Prediction error bounds for linear regression with the TREX

Bien, Jacob; Gaynanova, Irina; Lederer, Johannes; Müller, Christian L.

doi:10.1007/s11749-018-0584-4

Prediction error bounds for linear regression with the TREX

Original Paper
Published: 08 May 2018

Volume 28, pages 451–474, (2019)
Cite this article

TEST Aims and scope Submit manuscript

Jacob Bien¹,
Irina Gaynanova²,
Johannes Lederer ORCID: orcid.org/0000-0002-5369-3053³ &
…
Christian L. Müller⁴

492 Accesses
14 Citations
Explore all metrics

Abstract

The TREX is a recently introduced approach to sparse linear regression. In contrast to most well-known approaches to penalized regression, the TREX can be formulated without the use of tuning parameters. In this paper, we establish the first known prediction error bounds for the TREX. Additionally, we introduce extensions of the TREX to a more general class of penalties, and we provide a bound on the prediction error in this generalized setting. These results deepen the understanding of the TREX from a theoretical perspective and provide new insights into penalized regression in general.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient Kirszbraun extension with applications to regression

Article 07 December 2023

Asymptotic linear expansion of regularized M-estimators

Article 24 March 2021

On risk concentration for convex combinations of linear estimators

Article 01 October 2016

Notes

The right-hand side in Corollary 2 has a minimal magnitude of $4\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1/n$, while the right-hand side in Theorem 2 has a minimal magnitude of $(2+2/c)\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1/n$.

References

Arlot S, Celisse A (2010) A survey of cross-validation procedures for model selection. Stat Surv 4:40–79
Article MathSciNet MATH Google Scholar
Arlot S, Celisse A (2011) Segmentation of the mean of heteroscedastic data via cross-validation. Stat Comput 21(4):613–632
Article MathSciNet MATH Google Scholar
Bach, F (2008) Bolasso: Model consistent Lasso estimation through the bootstrap. In: Proceedings of the 25th international conference on machine learning, pp 33–40
Baraud Y, Giraud C, Huet S (2009) Gaussian model selection with an unknown variance. Ann Stat 37(2):630–672
Article MathSciNet MATH Google Scholar
Barber R, Candès E (2015) Controlling the false discovery rate via knockoffs. Ann Stat 43(5):2055–2085
Article MathSciNet MATH Google Scholar
Belloni A, Chernozhukov V, Wang L (2011) Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika 98(4):791–806
Article MathSciNet MATH Google Scholar
Bickel P, Ritov Y, Tsybakov A (2009) Simultaneous analysis of lasso and Dantzig selector. Ann Stat 37(4):1705–1732
Article MathSciNet MATH Google Scholar
Bien J, Gaynanova I, Lederer J, Müller C (2018) Non-convex global minimization and false discovery rate control for the TREX. J Comput Graph Stat 27(1):23–33. https://doi.org/10.1080/10618600.2017.1341414
Boucheron S, Lugosi G, Massart P (2013) Concentration inequalities: a nonasymptotic theory of independence. Oxford University Press, Cambridge
Book MATH Google Scholar
Bühlmann P, van de Geer S (2011) Statistics for high-dimensional data: methods, theory and applications. Springer, Berlin
Book MATH Google Scholar
Bunea F, Lederer J, She Y (2014) The group square-root lasso: theoretical properties and fast algorithms. IEEE Trans Inf Theory 60(2):1313–1325
Article MathSciNet MATH Google Scholar
Bunea F, Tsybakov A, Wegkamp M (2006) Aggregation and sparsity via $\ell _1$-penalized least squares. In: Proceedings of 19th annual conference on learning theory, pp 379–391
Candès E, Plan Y (2009) Near-ideal model selection by $\ell _1$ minimization. Ann Stat 37(5):2145–2177
Article MATH Google Scholar
Candes E, Tao T (2007) The Dantzig selector: Statistical estimation when p is much larger than n. Ann Stat 35(6):2313–2351
Article MathSciNet MATH Google Scholar
Chatterjee S, Jafarov J (2015) Prediction error of cross-validated lasso. arXiv:1502.06291
Chételat D, Lederer J, Salmon J (2017) Optimal two-step prediction in regression. Electron J Stat 11(1):2519–2546
Article MathSciNet MATH Google Scholar
Chichignoud M, Lederer J, Wainwright M (2016) A practical scheme and fast algorithm to tune the lasso with optimality guarantees. J Mach Learn Res 17:1–20
MathSciNet MATH Google Scholar
Combettes P, Müller C (2016) Perspective functions: proximal calculus and applications in high-dimensional statistics. J Math Anal Appl 457(2):1283–1306
Article MathSciNet MATH Google Scholar
Dalalyan A, Tsybakov A (2012) Mirror averaging with sparsity priors. Bernoulli 18(3):914–944
Article MathSciNet MATH Google Scholar
Dalalyan A, Tsybakov A (2012) Sparse regression learning by aggregation and langevin monte-carlo. J Comput Syst Sci 78(5):1423–1443
Article MathSciNet MATH Google Scholar
Dalalyan A, Hebiri M, Lederer J (2017) On the prediction performance of the lasso. Bernoulli 23(1):552–581
Article MathSciNet MATH Google Scholar
Dalalyan A, Tsybakov A (2007) Aggregation by exponential weighting and sharp oracle inequalities. In: Proceedings of 19th annual conference on learning theory, pp 97–111
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360
Article MathSciNet MATH Google Scholar
Giraud C, Huet S, Verzelen N (2012) High-dimensional regression with unknown variance. Stat Sci 27(4):500–518
Article MathSciNet MATH Google Scholar
Hebiri M, Lederer J (2013) How correlations influence lasso prediction. IEEE Trans Inf Theory 59(3):1846–1854
Article MathSciNet MATH Google Scholar
Huang C, Cheang G, Barron A (2008) Risk of penalized least squares, greedy selection and L1 penalization for flexible function libraries. Manuscript
Koltchinskii V, Lounici K, Tsybakov A (2011) Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann Stat 39(5):2302–2329
Article MathSciNet MATH Google Scholar
Lederer J, van de Geer S (2014) New concentration inequalities for empirical processes. Bernoulli 20(4):2020–2038
Article MathSciNet MATH Google Scholar
Lederer J, Müller C (2014) Topology adaptive graph estimation in high dimensions. arXiv:1410.7279
Lederer J, Müller C (2015) Don’t fall for tuning parameters: tuning-free variable selection in high dimensions with the TREX. In: Proceedings of the twenty-ninth AAAI conference on artificial intelligence
Lederer J, Yu L, Gaynanova I (2016) Oracle inequalities for high-dimensional prediction. arXiv:1608.00624
Lim N, Lederer J (2016) Efficient feature selection with large and high-dimensional data. arXiv:1609.07195
Massart P, Meynet C (2011) The Lasso as an $\ell _1$-ball model selection procedure. Electron J Stat 5:669–687
Article MathSciNet MATH Google Scholar
Meinshausen N, Bühlmann P (2010) Stability selection. J R Stat Soc Ser B 72(4):417–473
Article MathSciNet MATH Google Scholar
Raskutti G, Wainwright M, Yu B (2010) Restricted eigenvalue properties for correlated Gaussian designs. J Mach Learn Res 11:2241–2259
MathSciNet MATH Google Scholar
Rigollet P, Tsybakov A (2011) Exponential screening and optimal rates of sparse estimation. Ann Stat 39(2):731–771
Article MathSciNet MATH Google Scholar
Sabourin J, Valdar W, Nobel A (2015) A permutation approach for selecting the penalty parameter in penalized model selection. Biometrics 71(4):1185–1194
Article MathSciNet MATH Google Scholar
Shah R, Samworth R (2013) Variable selection with error control: another look at stability selection. J R Stat Soc Ser B 75(1):55–80
Article MathSciNet Google Scholar
Sun T, Zhang CH (2012) Scaled sparse linear regression. Biometrika 99(4):879–898
Article MathSciNet MATH Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Statist Soc Ser B 58(1):267–288
MathSciNet MATH Google Scholar
van de Geer S, Bühlmann P (2009) On the conditions used to prove oracle results for the lasso. Electron J Stat 3:1360–1392
Article MathSciNet MATH Google Scholar
van de Geer S, Lederer J (2013) The Bernstein-Orlicz norm and deviation inequalities. Probab Theory Relat Fields 157(1–2):225–250
Article MathSciNet MATH Google Scholar
van de Geer S, Lederer J (2013) The Lasso, correlated design, and improved oracle inequalities. IMS Collections 9:303–316
MathSciNet MATH Google Scholar
van der Vaart A, Wellner J (1996) Weak convergence and empirical processes. Springer, Berlin
Book MATH Google Scholar
van de Geer S (2007) The deterministic lasso. In Joint statistical meetings proceedings
van de Geer S (2000) Empirical processes in M-estimation. Cambridge University Press, Cambridge
MATH Google Scholar
Wainwright M (2009) Sharp thresholds for high-dimensional and noisy sparsity recovery using $\ell _1$-constrained quadratic programming (lasso). IEEE Trans Inf Theory 55(4):2183–2202
Article MATH Google Scholar
Wellner J (2017) The Bennett-Orlicz norm. Sankhya A 79(2):355–383
Article MathSciNet MATH Google Scholar
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B 68(1):49–67
Article MathSciNet MATH Google Scholar
Zhang C (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38(2):894–942
Article MathSciNet MATH Google Scholar
Zhuang R, Lederer J (2017) Maximum regularized likelihood estimators: a general prediction theory and applications. arXiv:1710.02950

Download references

Acknowledgements

We thank the editor and the reviewers for their insightful comments.

Author information

Authors and Affiliations

Department of Data Sciences and Operations, University of Southern California, Los Angeles, CA, USA
Jacob Bien
Department of Statistics, Texas A&M University, College Station, TX, USA
Irina Gaynanova
Departments of Statistics and Biostatistics, University of Washington, Box 354322, Seattle, WA, 98195-4322, USA
Johannes Lederer
Flatiron Institute, Simons Foundation, New York, NY, USA
Christian L. Müller

Authors

Jacob Bien
View author publications
You can also search for this author in PubMed Google Scholar
Irina Gaynanova
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Lederer
View author publications
You can also search for this author in PubMed Google Scholar
Christian L. Müller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Johannes Lederer.

Appendices

Appendix A: proof of a generalization of Theorem 1

We first consider a generalization of Assumption 1.

Assumption 4

The signal $\beta ^*$ is sufficiently small such that for some $\kappa _{1}>1$ and $\kappa _{2}>2$ with $1/\kappa _1 + 2/\kappa _2 <1$,

$$\begin{aligned} \Vert \beta ^*\Vert _1\le \frac{1}{4}\left( 1-\frac{1}{\kappa _{1}}-\frac{2}{\kappa _{2}}\right) \frac{\Vert \varepsilon \Vert _2^2}{\Vert X^{\top }\varepsilon \Vert _{\infty }}. \end{aligned}$$

As a first step toward the proof of Theorem 1, we show that any TREX solution has larger $\ell _{1}$-norm than any LASSO solution with tuning parameter $\lambda =\hat{u}$.

Lemma 3

Any TREX solution (4) satisfies

$$\begin{aligned} \Vert \hat{\beta }\Vert _{1}\ge \Vert \tilde{\beta }(\hat{u})\Vert _{1}, \end{aligned}$$

where $\tilde{\beta }(\hat{u})$ is any LASSO solution as in (2) with tuning parameter $\lambda =\hat{u}$.

Proof (of Lemma 3)

If $\tilde{\beta }(\hat{u})=0$, the statement holds trivially. Now for $\tilde{\beta }(\hat{u})\ne 0$, the KKT conditions for LASSO imply that

$$\begin{aligned} \Vert X^{\top }(Y-X\tilde{\beta }(\hat{u}))\Vert _{\infty }=\hat{u}. \end{aligned}$$

Together with the definition of $\hat{\beta }$, this yields

$$\begin{aligned} \Vert Y-X\hat{\beta }\Vert ^{2}_{2}+c\hat{u}\Vert \hat{\beta }\Vert _{1}\le \Vert Y-X\tilde{\beta }(\hat{u})\Vert ^{2}_{2}+c\hat{u}\Vert \tilde{\beta }(\hat{u})\Vert _{1}. \end{aligned}$$

On the other hand, the definition of the LASSO implies

$$\begin{aligned} \Vert Y-X\tilde{\beta }(\hat{u})\Vert ^{2}_{2}+2\hat{u}\Vert \tilde{\beta }(\hat{u})\Vert _{1}\le \Vert Y-X\hat{\beta }\Vert ^{2}_{2}+2\hat{u}\Vert \hat{\beta }\Vert _{1}. \end{aligned}$$

Combining these two displays gives us

$$\begin{aligned} (c-2)\hat{u}\Vert \hat{\beta }\Vert _{1}\le (c-2)\hat{u}\Vert \tilde{\beta }(\hat{u})\Vert _{1}. \end{aligned}$$

The claim follows now from $c< 2$.

We are now ready to prove a generalization of Theorem 1.

Theorem 4

Let Assumption 4 be fulfilled, and let $\tilde{\lambda }:=\max \{\kappa _{1} \hat{u},\frac{\kappa _{2}}{c}\Vert X^\top \varepsilon \Vert _\infty \}$. Then, for any $\hat{u}\le \Vert X^\top Y\Vert _\infty /\kappa _1$, the prediction loss of the TREX satisfies

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\le & {} \left( \frac{1}{\kappa _{1}}+\frac{2}{\kappa _{2}}\right) \Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2} \\&+\left( 2+\frac{2}{\kappa _{1}}+\frac{4}{\kappa _{2}}\right) \Vert X^\top \varepsilon \Vert _\infty \Vert \tilde{\beta }(\tilde{\lambda })-\beta ^*\Vert _1. \end{aligned}$$

Theorem 1 follows from Theorem 4 by setting $\kappa _{1}=2$, and $\kappa _{2}=8$.

Proof (of Theorem 4)

Assume first $\tilde{\beta }(\tilde{\lambda })=0$. Then, since $\hat{u}\le \Vert X^\top Y\Vert _\infty /\kappa _1$, the definition of the TREX implies

$$\begin{aligned} \Vert Y-X\hat{\beta }\Vert ^{2}_{2}+c\hat{u}\Vert \hat{\beta }\Vert _{1}\le \Vert Y\Vert _2^2/\kappa _1 =\Vert Y-X\tilde{\beta }(\tilde{\lambda })\Vert ^{2}_{2}/\kappa _1\,. \end{aligned}$$

Assume now $\tilde{\beta }(\tilde{\lambda })\ne 0$. In view of the KKT conditions for LASSO, $\tilde{\beta }(\tilde{\lambda })$ fulfills

$$\begin{aligned} \Vert X^\top (Y-X\tilde{\beta }(\tilde{\lambda }))\Vert _\infty =\tilde{\lambda }=\max \{\kappa _{1} \hat{u},\frac{\kappa _{2}}{c}\Vert X^\top \varepsilon \Vert _\infty \}\ge \kappa _{1}\hat{u}. \end{aligned}$$

The definition of the TREX therefore yields

$$\begin{aligned} \Vert Y-X\hat{\beta }\Vert ^{2}_{2}+c\hat{u}\Vert \hat{\beta }\Vert _{1}\le \frac{\Vert Y-X\tilde{\beta }(\tilde{\lambda })\Vert ^{2}_{2}}{{\kappa _{1}}}+c\hat{u}\Vert \tilde{\beta }(\tilde{\lambda })\Vert _1. \end{aligned}$$

We now observe that since $\kappa _1> 1$ by assumption, we have $\tilde{\lambda }\ge \hat{u}$, and one can verify easily that this implies $\Vert \tilde{\beta }(\hat{u})\Vert _{1}\ge \Vert \tilde{\beta }(\tilde{\lambda })\Vert _1$. At the same time, Lemma 3 ensures $\Vert \hat{\beta }\Vert _1\ge \Vert \tilde{\beta }^{\hat{u}}\Vert _1$. Thus, $c\hat{u}\Vert \hat{\beta }\Vert _{1}\ge c\hat{u}\Vert \tilde{\beta }(\tilde{\lambda })\Vert _1$ and, therefore, we find again

$$\begin{aligned} \Vert Y-X\hat{\beta }\Vert ^{2}_{2}\le \frac{\Vert Y-X\tilde{\beta }(\tilde{\lambda })\Vert ^{2}_{2}}{\kappa _{1}}. \end{aligned}$$

Invoking the model, this results in

$$\begin{aligned}&\kappa _1\Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\\&\quad \le \left( 1-\kappa _{1}\right) \Vert \varepsilon \Vert ^{2}_{2}+{2\varepsilon ^\top (X\beta ^*- X\tilde{\beta }(\tilde{\lambda }))}+2\varepsilon ^\top (X\hat{\beta }- X\beta ^*)+{\Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2}}. \end{aligned}$$

We can now use Hölder’s inequality and the triangle inequality to deduce

$$\begin{aligned}&\Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\\&\quad \le \left( \frac{1-\kappa _{1}}{\kappa _{1}}\right) \Vert \varepsilon \Vert _2^2+\frac{2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*-\tilde{\beta }(\tilde{\lambda })\Vert _1}{\kappa _{1}}\\&\qquad + 2\Vert X^\top \varepsilon \Vert _\infty \Vert \hat{\beta }\Vert _1+2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1+ \frac{\Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2}}{\kappa _1} \\&\quad \le \left( \frac{1-\kappa _{1}}{\kappa _{1}}\right) \Vert \varepsilon \Vert _2^2+\frac{2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*-\tilde{\beta }(\tilde{\lambda })\Vert _1}{\kappa _{1}}\\&\qquad + 2\Vert X^\top \varepsilon \Vert _\infty \left( \frac{\Vert Y-X\hat{\beta }\Vert _2^2}{c\hat{u}}+ \Vert \hat{\beta }\Vert _1\right) +2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1+ \frac{\Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2}}{\kappa _1}. \end{aligned}$$

Next, we observe that by the definition of our estimator $\hat{\beta }$ and of $\tilde{\lambda }$,

$$\begin{aligned}&\frac{\Vert Y-X\hat{\beta }\Vert _2^2}{c\hat{u}}+ \Vert \hat{\beta }\Vert _1\le \frac{\Vert Y-X\tilde{\beta }(\tilde{\lambda })\Vert _2^2}{c\tilde{\lambda }}+ \Vert \tilde{\beta }(\tilde{\lambda })\Vert _1\le \frac{\Vert Y-X\tilde{\beta }(\tilde{\lambda })\Vert _2^2}{\kappa _2\Vert X^\top \varepsilon \Vert _\infty }+ \Vert \tilde{\beta }(\tilde{\lambda })\Vert _1. \end{aligned}$$

Combining these two displays and using the model assumption then gives

$$\begin{aligned}&\Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\\&\quad \le \left( \frac{1-\kappa _{1}}{\kappa _{1}}\right) \Vert \varepsilon \Vert _2^2+\frac{2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*-\tilde{\beta }(\tilde{\lambda })\Vert _1}{\kappa _{1}}\\&\qquad + 2\Vert X^\top \varepsilon \Vert _\infty \left( \frac{\Vert Y-X\tilde{\beta }(\tilde{\lambda })\Vert _2^2}{\kappa _2\Vert X^\top \varepsilon \Vert _\infty }+ \Vert \tilde{\beta }(\tilde{\lambda })\Vert _1\right) +2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1 + \frac{\Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2}}{\kappa _1}\\&\quad = \left( \frac{1-\kappa _{1}}{\kappa _{1}}+\frac{2}{\kappa _2}\right) \Vert \varepsilon \Vert _2^2+\frac{2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*-\tilde{\beta }(\tilde{\lambda })\Vert _1}{\kappa _{1}}\\&\qquad + 2\Vert X^\top \varepsilon \Vert _\infty \left( \frac{2\varepsilon ^\top X(\beta ^*-\tilde{\beta }(\tilde{\lambda }))}{\kappa _2\Vert X^\top \varepsilon \Vert _\infty }+ \Vert \tilde{\beta }(\tilde{\lambda })\Vert _1\right) +2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1\\&\qquad +\left( \frac{1}{\kappa _1}+\frac{2}{\kappa _2}\right) {\Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2}}. \end{aligned}$$

We can now use Hölder’s inequality and rearrange the terms to get

$$\begin{aligned}&\Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\\&\quad \le \left( \frac{1-\kappa _{1}}{\kappa _{1}}+\frac{2}{\kappa _2}\right) \Vert \varepsilon \Vert _2^2+\left( \frac{1}{\kappa _1}+\frac{2}{\kappa _2}\right) 2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*-\tilde{\beta }(\tilde{\lambda })\Vert _1\\&\qquad + 2\Vert X^\top \varepsilon \Vert _\infty \Vert \tilde{\beta }(\tilde{\lambda })\Vert _1+2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1+\left( \frac{1}{\kappa _1}+\frac{2}{\kappa _2}\right) {\Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2}}. \end{aligned}$$

The last step is to use Assumption 4, which ensures

$$\begin{aligned} \Vert \beta ^*\Vert _1\le \left( \frac{1}{4}-\frac{1}{4\kappa _{1}}-\frac{1}{2\kappa _{2}}\right) \frac{\Vert \varepsilon \Vert _2^2}{\Vert X^{\top }\varepsilon \Vert _{\infty }} \end{aligned}$$

and, therefore,

$$\begin{aligned} 4\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1\le \left( \frac{\kappa _{1}-1}{\kappa _{1}}-\frac{2}{\kappa _{2}}\right) \Vert \varepsilon \Vert _2^2. \end{aligned}$$

The above display therefore yields

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}&\le -\,4\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1 +\left( \frac{1}{\kappa _1}+\frac{2}{\kappa _2}\right) 2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*-\tilde{\beta }(\tilde{\lambda })\Vert _1\\&\quad + 2\Vert X^\top \varepsilon \Vert _\infty \Vert \tilde{\beta }(\tilde{\lambda })\Vert _1+2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1\\&\quad +\left( \frac{1}{\kappa _1} +\frac{2}{\kappa _2}\right) {\Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2}}. \end{aligned}$$

Using the triangle inequality, we finally obtain

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\le & {} \left( 2+\frac{2}{\kappa _1}+\frac{4}{\kappa _2}\right) \Vert X^\top \varepsilon \Vert _\infty \Vert \tilde{\beta }(\tilde{\lambda })-\beta ^*\Vert _1\\&+\left( \frac{1}{\kappa _1}+\frac{2}{\kappa _2}\right) {\Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2}} \end{aligned}$$

as desired.

Corollary 4

Let Assumption 4 be fulfilled, and let $\tilde{\lambda }:=\max \{\kappa _{1} \hat{u},\frac{\kappa _{2}}{c}\Vert X^\top \varepsilon \Vert _\infty \}$. Furthermore, let $\kappa _1,\kappa _2>0$ be such that

$$\begin{aligned} \frac{1}{\kappa _2}+\frac{\kappa _1}{\kappa _2 + 2\kappa _1} \le \frac{1}{c}. \end{aligned}$$

Then for any $\hat{u}\le \Vert X^\top Y\Vert _\infty /\kappa _1$, the prediction loss of TREX satisfies

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\le \left( \frac{1}{\kappa _1}+\frac{2}{\kappa _2}\right) \frac{16s\tilde{\lambda }^2}{\nu ^2 n}, \end{aligned}$$

where $\nu $ is the compatibility constant defined in (5).

Corollary 3 follows from Corollary 4 by setting $\kappa _{1}=2$, and $\kappa _{2}=8$, which satisfy the requirement for any $c\in (0,2)$.

Proof (of Corollary 4)

Using Theorem 4 and the definition of $\tilde{\lambda }$, the TREX prediction loss satisfies

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}&\le \left( \frac{1}{\kappa _1}+\frac{2}{\kappa _2}\right) {\Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2}}\\&\quad + \left( 2+\frac{2}{\kappa _1}+\frac{4}{\kappa _2}\right) \Vert X^\top \varepsilon \Vert _\infty \Vert \tilde{\beta }(\tilde{\lambda })-\beta ^*\Vert _1\\&\le \left( \frac{1}{\kappa _1}+\frac{2}{\kappa _2}\right) \left[ \Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2} + \left( 2+\frac{2\kappa _1\kappa _2}{\kappa _2 + 2\kappa _1}\right) \frac{c}{{\kappa _2 }}\tilde{\lambda }\Vert \tilde{\beta }(\tilde{\lambda })-\beta ^*\Vert _1\right] . \end{aligned}$$

On the other hand, the LASSO estimator $\tilde{\beta }(\tilde{\lambda })$ satisfies (Bühlmann and van de Geer 2011, Theorem 6.1)

$$\begin{aligned} \Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2} +2\tilde{\lambda }\Vert \tilde{\beta }(\tilde{\lambda })-\beta ^*\Vert _1 \le \frac{16s\tilde{\lambda }^2}{\nu ^2n}. \end{aligned}$$

Since by assumption

$$\begin{aligned} \left( 2 + \frac{2\kappa _1\kappa _2}{\kappa _2 + 2\kappa _1}\right) \frac{c}{{\kappa _2} }\le 2, \end{aligned}$$

the LASSO bound implies

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}&\le \left( \frac{1}{\kappa _1}+\frac{2}{\kappa _2}\right) \frac{16s\tilde{\lambda }^2}{\nu ^2 n} \end{aligned}$$

as desired.

Appendix B: proof of a generalization of Theorem 2

We consider a generalization of the TREX according to

$$\begin{aligned} \hat{\beta }\in \mathop {\mathrm {arg\,min}}\left\{ \frac{\Vert Y-X\beta \Vert ^2_2}{c\Omega ^*(X^\top (Y-X\beta ))}+\Omega (\beta )\right\} , \end{aligned}$$

where $\Omega $ is a norm on $\mathbb {R}^p$, $\Omega ^{*}(\eta ):=\sup \{\eta ^{\top }\beta :\Omega (\beta )\le 1\}$ is the dual of $\Omega $, $0<c<2$, and the minimum is taken over all $\beta \in \mathbb {R}^p$. We also set $\hat{u}:=\Omega ^{*}(X^{\top }(Y-X\hat{\beta }))$ with some abuse of notation. The corresponding generalization of Assumption 2 then reads as follows.

Assumption 5

The regression vector $\beta ^*$ is sufficiently large such that

$$\begin{aligned} \Omega ^{*}(X^{\top }X\beta ^*)\ge \left( 1+\frac{2}{c}\right) \Omega ^{*}(X^{\top }\varepsilon ). \end{aligned}$$

We now prove a generalization of Theorem 2.

Theorem 5

Let Assumption 5 be fulfilled. If $\hat{u}\le \Omega ^{*}(X^{\top }Y)$, then the prediction loss of TREX satisfies

$$\begin{aligned} \frac{\Vert X\hat{\beta }-X\beta ^*\Vert _{2}^{2}}{n}\le \frac{\left( 2\Omega ^{*}(X^{\top }\varepsilon )+\max \left\{ \hat{u},\frac{2}{c}\Omega ^{*}(X^{\top }\varepsilon )\right\} \right) \Omega (\beta ^*)}{n}. \end{aligned}$$

Theorem 2 follows from Theorem 5 by setting $\Omega (\cdot ):=\Vert \cdot \Vert _{1}.$

Proof (of Theorem 5)

The definition of the estimator implies

$$\begin{aligned} \frac{\Vert Y-X\hat{\beta }\Vert ^2_2}{c\cdot \hat{u}}+\Omega (\hat{\beta })\le \frac{\Vert Y\Vert ^2_2}{c\cdot \Omega ^*(X^\top Y)}, \end{aligned}$$

which yields together with the model assumptions

$$\begin{aligned}&\Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}+\Vert \varepsilon \Vert ^{2}_{2}+2\varepsilon ^{\top }(X\beta ^*-X\hat{\beta })+c\hat{u}\Omega (\hat{\beta })\\&\quad \le \frac{\hat{u}}{\Omega ^{*}(X^{\top }Y)}\left( \Vert \varepsilon \Vert ^{2}_{2}+\Vert X\beta ^*\Vert _{2}^{2}+2\varepsilon ^{\top }X\beta ^*\right) . \end{aligned}$$

Rearranging the terms and Hölder’s inequality in the form of $2\varepsilon ^\top X\hat{\beta }\le 2\Omega ^*(X^\top \varepsilon )\Omega (\hat{\beta })$ and $\Vert X\beta ^*\Vert ^{2}_{2}=\beta ^*{}^{\top }X^{\top }X\beta ^*\le \Omega ^{*}(X^{\top }X\beta ^*)\Omega (\beta ^*)$ then gives

$$\begin{aligned} \begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\le&\left( \frac{\hat{u}}{\Omega ^{*}(X^{\top }Y)}-1\right) \Vert \varepsilon \Vert ^{2}_{2}+(2\Omega ^{*}(X^{\top }\varepsilon )-c\hat{u})\Omega (\hat{\beta })\\&+\frac{\hat{u}\Omega ^{*}(X^{\top }X\beta ^*)\Omega (\beta ^*)}{\Omega ^{*}(X^{\top }Y)}+2\left( \frac{\hat{u}}{\Omega ^{*}(X^{\top }Y)}-1\right) 2\varepsilon ^{\top }X\beta ^*. \end{aligned} \end{aligned}$$

(9)

Case 1: We first consider the case $2\Omega ^{*}(X^{\top }\varepsilon )\le c\hat{u}$.

For this, we first note that $\hat{u} \le \Omega ^{*}(X^{\top }Y)$ by assumption. Using this and $2\Omega ^{*}(X^{\top }\varepsilon )\le c\hat{u}$ allows us to remove the first two terms on the right-hand side of Inequality (9) so that

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\le \frac{\hat{u}\Omega ^{*}(X^{\top }X\beta ^*)\Omega (\beta ^*)}{\Omega ^{*}(X^{\top }Y)}+\left( \frac{\hat{u}}{\Omega ^{*}(X^{\top }Y)}-1\right) 2\varepsilon ^{\top }X\beta ^*. \end{aligned}$$

Since $2\varepsilon ^\top X\beta ^*\le 2\Omega ^{*}(X^{\top }\varepsilon )\Omega (\beta ^*)$ due to Hölder’s Inequality, and since $\hat{u}\le \Omega ^*(X^\top Y)$ by definition of our estimator, we therefore obtain from the above display and the model assumptions

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}&\le \frac{\hat{u}\Omega ^{*}(X^{\top }X\beta ^*)\Omega (\beta ^*)}{\Omega ^{*}(X^{\top }Y)}+\left( 1-\frac{\hat{u}}{\Omega ^{*}(X^{\top }Y)}\right) 2\Omega ^{*}(X^{\top }\varepsilon )\Omega (\beta ^*)\\&=2\Omega ^{*}(X^{\top }\varepsilon )\Omega (\beta ^*)+\hat{u}\left( \frac{\Omega ^{*}(X^{\top }X\beta ^*)-2\Omega ^{*}(X^{\top }\varepsilon )}{\Omega ^{*}(X^{\top }Y)}\right) \Omega (\beta ^*)\\&=2\Omega ^{*}(X^{\top }\varepsilon )\Omega (\beta ^*)+\hat{u}\left( \frac{\Omega ^{*}(X^{\top }X\beta ^*)-2\Omega ^{*}(X^{\top }\varepsilon )}{\Omega ^{*}(X^{\top }X\beta ^*+X^{\top }\varepsilon )}\right) \Omega (\beta ^*). \end{aligned}$$

Next, we note that the triangle inequality gives

$$\begin{aligned} \Omega ^{*}(X^{\top }X\beta ^*+X^{\top }\varepsilon )\ge \Omega ^{*}(X^{\top }X\beta ^*)-\Omega ^{*}(X^{\top }\varepsilon ). \end{aligned}$$

Plugging this in the previous display finally yields

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\le 2\Omega ^{*}(X^{\top }\varepsilon )\Omega (\beta ^*)+\hat{u}\Omega (\beta ^*), \end{aligned}$$

which concludes the proof for Case 1.

Case 2: We now consider the case $2\Omega ^{*}(X^{\top }\varepsilon )\ge c\hat{u}$.

Similarly as before, we start with the definition of the estimator, which yields in particular

$$\begin{aligned} \Omega (\hat{\beta })\le \frac{\Vert Y\Vert ^2_2}{c\cdot \Omega ^*(X^\top Y)}. \end{aligned}$$

Invoking the model assumptions and Hölder’s inequality then gives

$$\begin{aligned} \Omega (\hat{\beta })&\le \frac{1}{c\Omega ^{*}(X^{\top }Y)}\left( \Vert \varepsilon \Vert ^{2}_{2}+\Vert X\beta ^*\Vert ^{2}_{2}+2\varepsilon ^{\top }X\beta ^*\right) \\&\le \frac{1}{c\Omega ^{*}(X^{\top }Y)}\left( \Vert \varepsilon \Vert ^{2}_{2}+\Omega ^{*}(X^{\top }X\beta ^*)\Omega (\beta ^*)+2\varepsilon ^\top X\beta ^*\right) . \end{aligned}$$

We can now plug this into Inequality (9) to obtain

$$\begin{aligned}&\Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\le \left( \frac{\hat{u}}{\Omega ^{*}(X^{\top }Y)}-1\right) \Vert \varepsilon \Vert ^{2}_{2}\\&\quad +(2\Omega ^*(X^\top \varepsilon )-c\hat{u})\frac{1}{c\Omega ^{*}(X^{\top }Y)}\left( \Vert \varepsilon \Vert ^{2}_{2}+\Omega ^{*}(X^{\top }X\beta ^*)\Omega (\beta ^*)+2\varepsilon ^{\top }X\beta ^*\right) \\&\quad +\frac{\hat{u}\Omega ^{*}(X^{\top }X\beta ^*)\Omega (\beta ^*)}{\Omega ^{*}(X^{\top }Y)}+\left( \frac{\hat{u}}{\Omega ^{*}(X^{\top }Y)}-1\right) 2\varepsilon ^{\top }X\beta ^*. \end{aligned}$$

We can now rearrange the terms to get

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\le&\left( \frac{2\Omega ^{*}(X^{\top }\varepsilon )}{c\Omega ^{*}(X^{\top }Y)}-1\right) \Vert \varepsilon \Vert ^{2}_{2}+\left( \frac{2\Omega ^{*}(X^{\top }\varepsilon )}{c\Omega ^{*}(X^{\top }Y)}-1\right) 2\varepsilon ^{\top }X\beta ^*\\&+\frac{2\Omega ^{*}(X^{\top }\varepsilon )}{c\Omega ^{*}(X^{\top }Y)}\Omega ^{*}(X^{\top }X\beta ^*)\Omega (\beta ^*). \end{aligned}$$

We now observe that Assumption 5 implies via the triangle inequality and the model assumptions that

$$\begin{aligned} \Omega ^{*}(X^{\top }Y)\ge \frac{2\Omega ^{*}(X^{\top }\varepsilon )}{c}. \end{aligned}$$

Using this, Hölder’s inequality, and the triangle inequality, we then find

$$\begin{aligned}&\Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\\&\quad \le \left( 1-\frac{2\Omega ^{*}(X^{\top }\varepsilon )}{c\Omega ^{*}(X^{\top }Y)}\right) 2\Omega ^*(X^{\top }\varepsilon )\Omega (\beta ^*)\\&\qquad +2\Omega ^*(X^\top \varepsilon )\Omega (\beta ^*)/c+\frac{2\Omega ^*(X^\top \varepsilon )}{c\Omega ^*(X^\top Y)}\Omega ^*(X^\top \varepsilon )\Omega (\beta ^*)\\&\quad = 2\Omega ^*(X^{\top }\varepsilon )\Omega (\beta ^*)+2\Omega ^*(X^\top \varepsilon )\Omega (\beta ^*)/c-\frac{2\Omega ^*(X^\top \varepsilon )}{c\Omega ^*(X^\top Y)}\Omega ^*(X^\top \varepsilon )\Omega (\beta ^*)\\&\quad \le (2+2/c)\Omega ^*(X^{\top }\varepsilon )\Omega (\beta ^*). \end{aligned}$$

This concludes the proof for Case 2 and, therefore, the proof of Theorem 5.

Appendix C: Prediction error results for $p=128$

See Fig. 2.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bien, J., Gaynanova, I., Lederer, J. et al. Prediction error bounds for linear regression with the TREX. TEST 28, 451–474 (2019). https://doi.org/10.1007/s11749-018-0584-4

Download citation

Received: 26 May 2017
Accepted: 19 April 2018
Published: 08 May 2018
Issue Date: 01 June 2019
DOI: https://doi.org/10.1007/s11749-018-0584-4

Keywords

Mathematics Subject Classification

62J07

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Prediction error bounds for linear regression with the TREX

Abstract

Access this article

Similar content being viewed by others

Efficient Kirszbraun extension with applications to regression

Asymptotic linear expansion of regularized M-estimators

On risk concentration for convex combinations of linear estimators

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: proof of a generalization of Theorem 1

Assumption 4

Lemma 3

Proof (of Lemma 3)

Theorem 4

Proof (of Theorem 4)

Corollary 4

Proof (of Corollary 4)

Appendix B: proof of a generalization of Theorem 2

Assumption 5

Theorem 5

Proof (of Theorem 5)

Appendix C: Prediction error results for \(p=128\)

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Prediction error bounds for linear regression with the TREX

Abstract

Access this article

Similar content being viewed by others

Efficient Kirszbraun extension with applications to regression

Asymptotic linear expansion of regularized M-estimators

On risk concentration for convex combinations of linear estimators

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: proof of a generalization of Theorem 1

Assumption 4

Lemma 3

Proof (of Lemma 3)

Theorem 4

Proof (of Theorem 4)

Corollary 4

Proof (of Corollary 4)

Appendix B: proof of a generalization of Theorem 2

Assumption 5

Theorem 5

Proof (of Theorem 5)

Appendix C: Prediction error results for \(p=128\)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation