Comments on: High-dimensional simultaneous inference with the bootstrap

Lockhart, Richard A.; Samworth, Richard J.

doi:10.1007/s11749-017-0555-1

Comments on: High-dimensional simultaneous inference with the bootstrap

Discussion
Open access
Published: 09 October 2017

Volume 26, pages 734–739, (2017)
Cite this article

Download PDF

You have full access to this open access article

TEST Aims and scope Submit manuscript

Comments on: High-dimensional simultaneous inference with the bootstrap

Download PDF

1101 Accesses
2 Citations
Explore all metrics

Abstract

We congratulate the authors on their stimulating contribution to the burgeoning high-dimensional inference literature. The bootstrap offers such an attractive methodology in these settings, but it is well-known that its naive application in the context of shrinkage/superefficiency is fraught with danger (e.g. Samworth in Biometrika 90:985–990, 2003; Chatterjee and Lahiri in J Am Stat Assoc 106:608–625, 2011). The authors show how these perils can be elegantly sidestepped by working with de-biased, or de-sparsified, versions of estimators. In this discussion, we consider alternative approaches to individual and simultaneous inference in high-dimensional linear models, and retain the notation of the paper.

Comments on: High-dimensional simultaneous inference with the bootstrap

Article 09 October 2017

Rejoinder on: High-dimensional simultaneous inference with the bootstrap

Article 09 October 2017

Comments on: High-dimensional simultaneous inference with the bootstrap

Article 09 October 2017

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Why penalise coefficients of variables of interest?

Suppose that for some, presumably small, set $G \subseteq \{1,\ldots ,p\}$, we want a confidence set for $\beta _G^0$. Much of the recent literature, including the paper under discussion, proceeds by constructing an initial estimator, such as the Lasso estimator $\hat{\beta }$, and then attempting to de-bias it. Our starting point is the following provocative question: since we know in advance the set of variables we are interested in, why would we want to penalise these coefficients in the first place? Of course, it is standard practice not to penalise the intercept term in high-dimensional linear models, to preserve location equivariance, but we now consider taking this one stage further. More precisely, consider the linear model

$$\begin{aligned} Y = \mathbf {X}\beta ^0 + \epsilon , \end{aligned}$$

where the columns of $\mathbf {X}$ have Euclidean length $n^{1/2}$, where $\mathbf {X}_G^T\mathbf {X}_G$ is positive definite, and where, for simplicity, we assume that $\epsilon \sim N_n(0,\sigma ^2I)$. We further assume that the set $S := \{j : \beta _j^0 \ne 0\}$ of signal variables has cardinality s, and let $N := \{1,\ldots ,p\} \setminus S$. For $\lambda > 0$, let

$$\begin{aligned} (\hat{\beta }_G,\hat{\beta }_{-G}) := \mathop {\hbox {argmin}}\limits _{(\beta _G,\beta _{-G}) \in \mathbb {R}^{|G|} \times \mathbb {R}^{p-|G|}} \frac{1}{n}\Vert Y - \mathbf {X}_G\beta _G - \mathbf {X}_{-G}\beta _{-G}\Vert _2^2 + \lambda \Vert \beta _{-G}\Vert _1, \end{aligned}$$

where we emphasise that $\Vert \beta _G\Vert _1$ is unpenalised. For fixed $\beta _{-G} \in \mathbb {R}^{p-|G|}$, the solution in the first argument is given by ordinary least squares:

$$\begin{aligned} \hat{\beta }_G(\beta _{-G}) := (\mathbf {X}_G^T\mathbf {X}_G)^{-1}\mathbf {X}_G^T(Y - \mathbf {X}_{-G}\beta _{-G}). \end{aligned}$$

We therefore find that

$$\begin{aligned} \hat{\beta }_{-G} = \mathop {\hbox {argmin}}_{\beta _{-G} \in \mathbb {R}^{p-|G|}} \frac{1}{n}\Vert (I-P_G)(Y - \mathbf {X}_{-G}\beta _{-G})\Vert _2^2 + \lambda \Vert \beta _{-G}\Vert _1, \end{aligned}$$

(1)

where $P_G := \mathbf {X}_G(\mathbf {X}_G^T\mathbf {X}_G)^{-1}\mathbf {X}_G^T$ denotes the matrix representing an orthogonal projection onto the column space of $\mathbf {X}_G$. In other words, $\hat{\beta }_{-G}$ is simply the Lasso solution with response and design matrix pre-multiplied by $(I-P_G)$. Moreover,

$$\begin{aligned} \hat{\beta }_G = \hat{\beta }_G(\hat{\beta }_{-G}) = (\mathbf {X}_G^T\mathbf {X}_G)^{-1}\mathbf {X}_G^T(Y - \mathbf {X}_{-G}\hat{\beta }_{-G}). \end{aligned}$$

For our theoretical analysis of $\hat{\beta }_G$, we will require the following compatibility condition:

(A1) :: There exists $\phi _0 > 0$ such that for all $b \in \mathbb {R}^{p-|G|}$ with $\Vert b_N\Vert _1 \le 3\Vert b_S\Vert _1$, we have
$$\begin{aligned} \Vert b_S\Vert _1^2 \le \frac{s\Vert (I-P_G)\mathbf {X}_{-G}b\Vert _2^2}{n\phi _0^2}. \end{aligned}$$

The theorem below is only a small modification of existing results in the literature (e.g. Bickel et al. 2009), but for completeness we provide a proof in “Appendix”.

Theorem 1

Assume (A1), and let $\lambda := A\sigma \sqrt{\frac{\log p}{n}}$. Then with probability at least $1 - p^{-(A^2/8-1)}$,

$$\begin{aligned} \frac{1}{n}\Vert (I-P_G)\mathbf {X}_{-G}(\hat{\beta }_{-G} - \beta _{-G}^0)\Vert _2^2 + \frac{\lambda }{2}\Vert \hat{\beta }_{-G} - \beta _{-G}^0\Vert _1 \le \frac{3A^2}{\phi _0^2}\frac{\sigma ^2s\log p}{n}. \end{aligned}$$

Theorem 1 allows us to show that if, in addition to (A1), the columns of $\mathbf {X}_G$ and those of $\mathbf {X}_{-G}$ satisfy a strong lack of correlation condition, then $\hat{\beta }_G$ can be used for asymptotically valid inference for $\beta _G$. To formalise this latter condition, it is convenient to let $\mathbf {\Theta }$ denote the $|G| \times (p-|G|)$ matrix $(\mathbf {X}_G^T\mathbf {X}_G)^{-1}\mathbf {X}_G^T \mathbf {X}_{-G}$.

Corollary 2

Consider an asymptotic framework in which $s=s_n \ge 1$ and $p=p_n \rightarrow \infty $ as $n \rightarrow \infty $, but $\sigma ^2 > 0$ and G are constant. Assume (A1) holds for sufficiently large n (with $\phi _0$ not depending on n), and also that $\Vert \mathbf {\Theta }\Vert _\infty = o(s^{-1} \log ^{-1/2} p)$. If we choose $\lambda := A\sigma \sqrt{\frac{\log p}{n}}$ in the above procedure with constant $A > 2\sqrt{2}$, then

$$\begin{aligned} n^{1/2}(\hat{\beta }_G - \beta _G^0) \mathop {\rightarrow }\limits ^{d} N_{|G|}\bigl (0,\sigma ^2(\mathbf {X}_G^T\mathbf {X}_G)^{-1}\bigr ). \end{aligned}$$

Proof

We can write

$$\begin{aligned} n^{1/2}(\hat{\beta }_G - \beta _G^0) = n^{1/2}(\mathbf {X}_G^T\mathbf {X}_G)^{-1}\mathbf {X}_G^T\epsilon - \varDelta , \end{aligned}$$

where $\varDelta := n^{1/2}(\mathbf {X}_G^T\mathbf {X}_G)^{-1}\mathbf {X}_G^T \mathbf {X}_{-G}(\hat{\beta }_{-G} - \beta _{-G}^0)$. Now

$$\begin{aligned} n^{1/2}(\mathbf {X}_G^T\mathbf {X}_G)^{-1}\mathbf {X}_G^T\epsilon \sim N_{|G|}\bigl (0,\sigma ^2(\mathbf {X}_G^T\mathbf {X}_G)^{-1}\bigr ). \end{aligned}$$

Also, from the proof of Theorem 1, on $\varOmega _0 := \bigl \{\Vert \mathbf {X}_{-G}^T(I-P_G)\epsilon \Vert _\infty /n \le \lambda /2\}$,

$$\begin{aligned} \Vert \varDelta \Vert _\infty \le \Vert \mathbf {\Theta }\Vert _\infty n^{1/2}\Vert \hat{\beta }_{-G} - \beta _{-G}^0\Vert _1 \le \frac{6A}{\phi _0^2} \Vert \mathbf {\Theta }\Vert _\infty s\log ^{1/2} p \rightarrow 0. \end{aligned}$$

Since $\mathbb {P}(\varOmega _0) \rightarrow 1$, the conclusion follows.

We remark that for $j \in G^c$, $\mathbf {\Theta }_j$ is the coefficient in the ordinary least squares regression of $X_j$ on $\mathbf {X}_G$. Even though the condition on $\Vert \mathbf {\Theta }\Vert _\infty $ is strong, it may well be reasonable to suppose that, having pre-specified the index set G of variables that we are interested in, we should avoid including in our model other variables that have significant correlation with $\mathbf {X}_G$.

2 More complicated settings

Without this strong orthogonality condition, we might instead consider adjusting $\hat{\beta }_G$ by debiasing or de-sparsifying $\hat{\beta }_{-G}$. Following van de Geer et al. (2014), we suggest replacing $\hat{\beta }_{-G}$ by

$$\begin{aligned} \hat{b}_{-G} = \hat{\beta }_{-G} + \frac{1}{n} M \mathbf {X}_{-G}^T(I-P_G)(Y - \mathbf {X}_{-G}\hat{\beta }_{-G}) \end{aligned}$$

for some matrix $M \in \mathbb {R}^{(p-|G|)\times (p-|G|)}$. This yields the de-biased estimator

$$\begin{aligned} \hat{b}_G&= (\mathbf {X}_G^T\mathbf {X}_G)^{-1}\mathbf {X}_G^T(Y - \mathbf {X}_{-G}\hat{b}_{-G}) \\&= \beta _{G}^0 + (\mathbf {X}_G^T\mathbf {X}_G)^{-1}\mathbf {X}_G^T \epsilon - \frac{1}{n} \varvec{\varTheta }M \mathbf {X}_{-G}^T(I-P_G)\epsilon - R(\hat{\beta }_{-G}-\beta _{-G}^0), \end{aligned}$$

where R is the $|G| \times (p-|G|)$ matrix given by

$$\begin{aligned} R := \varvec{\varTheta } - \frac{1}{n} \varvec{\varTheta }M \mathbf {X}_{-G}^T(I-P_G)\mathbf {X}_{-G}. \end{aligned}$$

Under our Gaussian errors assumption, $(\mathbf {X}_G^T\mathbf {X}_G)^{-1}\mathbf {X}_G^T \epsilon $ and $n^{-1}\varvec{\varTheta } M \mathbf {X}_{-G}^T(I-P_G)\epsilon $ are independent centred Gaussian random vectors; thus if the remainder term $R(\hat{\beta }_{-G}-\beta _{-G}^0)$ is of smaller order, we see that our estimate $\hat{b}_G$ is approximately centred Gaussian. The techniques of van de Geer et al. (2014) or Javanmard and Montanari (2014) might then be used to give asymptotic justifications for Gaussian confidence sets and hypothesis tests concerning $\beta _G^0$. But another very interesting direction would be to adapt the bootstrap approaches proposed in the current paper to the estimate $\hat{b}_G$.

As in van de Geer et al. (2014), we should choose M depending on $\mathbf {X}$ to control

$$\begin{aligned} \delta :=\Vert R(\hat{\beta }_{-G}-\beta _{-G})\Vert _\infty \le \Vert R\Vert _\infty \Vert \hat{\beta }_{-G}-\beta _{-G}\Vert _1. \end{aligned}$$

Note that we may write the matrix R in terms of the sample covariance matrix of the covariates ${\hat{\varSigma }} :=\mathbf {X}^T\mathbf {X}/n$ (using obvious notation for the partitioning) as

$$\begin{aligned} R = {\hat{\varSigma }}_{G,G}^{-1}{\hat{\varSigma }}_{G,-G} \bigl (I-M({\hat{\varSigma }}_{-G,-G} -{\hat{\varSigma }}_{-G,G}{\hat{\varSigma }}_{G,G}^{-1}{\hat{\varSigma }}_{G,-G})\bigr ). \end{aligned}$$

Of course, if $\hat{\varSigma }$ is invertible, then

$$\begin{aligned} ({\hat{\varSigma }}_{-G,-G} -{\hat{\varSigma }}_{-G,G}{\hat{\varSigma }}_{G,G}^{-1}{\hat{\varSigma }}_{G,-G})^{-1} = (\hat{\varSigma }^{-1})_{-G,-G}, \end{aligned}$$

so M can be thought of as an approximation to $(\hat{\varSigma }^{-1})_{-G,-G}$ (even though $\hat{\varSigma }$ is not invertible when $p > n$). In general, we might use concentration inequalities for entries in ${\hat{\varSigma }}$ to control $\Vert R\Vert _\infty $; if we think of |G| as small, then we only have O(p) entries to control, rather than $O(p^2)$ as is more typical in these debiasing problems. We hope to pursue these ideas elsewhere.

References

Bickel PJ, Ritov Y, Tsybakov AB (2009) Simultaneous analysis of Lasso and Dantzig selector. Ann Stat 37:1705–1732
Article MathSciNet MATH Google Scholar
Chatterjee A, Lahiri SN (2011) Bootstrapping Lasso estimators. J Am Stat Assoc 106:608–625
Article MathSciNet MATH Google Scholar
Javanmard A, Montanari A (2014) Confidence intervals and hypothesis testing for high-dimensional regression. J Mach Learn Res 15:2869–2909
MathSciNet MATH Google Scholar
Samworth R (2003) A note on methods of restoring consistency to the bootstrap. Biometrika 90:985–990
Article MathSciNet MATH Google Scholar
van de Geer S, Bühlmann P, Ritov Y, Dezeure R (2014) On asymptotically optimal confidence regions and tests for high-dimensional models. Ann Stat 42:1166–1202
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The first author thanks St John’s College, Cambridge and the Statistical Laboratory at the University of Cambridge for kind hospitality over the period where this research was carried out.

Author information

Authors and Affiliations

Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC, V5A 1S6, Canada
Richard A. Lockhart
Statistical Laboratory, Wilberforce Road, University of Cambridge, Cambridge, CB3 0WB, UK
Richard J. Samworth

Authors

Richard A. Lockhart
View author publications
You can also search for this author in PubMed Google Scholar
Richard J. Samworth
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Richard J. Samworth.

Additional information

This comment refers to the invited paper available at: doi:10.1007/s11749-017-0554-2.

The first author is supported by a grant from the Natural Sciences and Engineering Research Council of Canada. The second author is support by an Engineering and Physical Sciences Research Council Fellowship (Grant No. EP/J017213/1) and a grant from the Leverhulme Trust (Grant No. RG81761).

Appendix

Proof of Theorem 1

The KKT conditions for the problem (1) state that

$$\begin{aligned} \frac{1}{n}\mathbf {X}_{-G}^T(I-P_G)(Y - \mathbf {X}_{-G}\hat{\beta }_{-G}) = \lambda \gamma , \end{aligned}$$

where $\Vert \gamma \Vert _\infty \le 1$ and $\gamma _j = \mathrm {sgn}(\hat{\beta }_{-G,j})$ if $\hat{\beta }_{-G,j} \ne 0$. Thus

$$\begin{aligned}&\frac{1}{n}(\beta _{-G}^0 - \hat{\beta }_{-G})^T\mathbf {X}_{-G}^T(I-P_G)\mathbf {X}_{-G}(\beta _{-G}^0 - \hat{\beta }_{-G}) \\&\qquad = \lambda (\beta _{-G}^0 - \hat{\beta }_{-G})^T\gamma - \frac{1}{n}(\beta _{-G}^0 - \hat{\beta }_{-G})^T\mathbf {X}_{-G}^T(I-P_G)\epsilon \\&\qquad = \lambda (\beta _{-G}^0)^T\gamma - \lambda \Vert \hat{\beta }_{-G}\Vert _1 - \frac{1}{n}(\beta _{-G}^0 - \hat{\beta }_{-G})^T\mathbf {X}_{-G}^T(I-P_G)\epsilon \\&\qquad \le \lambda \Vert \beta _{-G,S}^0\Vert _1 - \lambda \Vert \hat{\beta }_{-G}\Vert _1 + \Vert \hat{\beta }_{-G} - \beta _{-G}^0\Vert _1 \frac{1}{n}\Vert \mathbf {X}_{-G}^T(I-P_G)\epsilon \Vert _\infty . \end{aligned}$$

Let $\varOmega _0 := \bigl \{\Vert \mathbf {X}_{-G}^T(I-P_G)\epsilon \Vert _\infty /n \le \lambda /2\}$. Then since $\mathbf {X}_{-G}^T(I-P_G)\epsilon \sim N_p(0,\sigma ^2\mathbf {X}_{-G}^T(I-P_G)\mathbf {X}_{-G})$, and since the diagonal entries of $\mathbf {X}_{-G}^T(I-P_G)\mathbf {X}_{-G}$ are bounded above by n, we have $\mathbb {P}(\varOmega _0^c) \le p^{-(A^2/8-1)}$. Moreover, on $\varOmega _0$,

$$\begin{aligned} \frac{1}{n}&(\hat{\beta }_{-G} - \beta _{-G}^0)^T\mathbf {X}_{-G}^T(I-P_G)\mathbf {X}_{-G}(\hat{\beta }_{-G} - \beta _{-G}^0) + \frac{\lambda }{2}\Vert \hat{\beta }_{-G,N}\Vert _1 \\&= \frac{1}{n}(\hat{\beta }_{-G} - \beta _{-G}^0)^T\mathbf {X}_{-G}^T(I-P_G)\mathbf {X}_{-G}(\hat{\beta }_{-G} - \beta _{-G}^0) \\&\qquad + \lambda \Vert \hat{\beta }_{-G}\Vert _1 - \lambda \Vert \hat{\beta }_{-G,S}\Vert _1 - \frac{\lambda }{2}\Vert \hat{\beta }_{-G,N}\Vert _1 \\&\le \frac{\lambda }{2}\Vert \hat{\beta }_{-G} - \beta _{-G}^0\Vert _1 - \frac{\lambda }{2}\Vert \hat{\beta }_{-G,N}\Vert _1 + \lambda (\Vert \beta _{-G,S}^0\Vert _1 - \Vert \hat{\beta }_{-G,S}\Vert _1) \\&\le \frac{3\lambda }{2}\Vert \hat{\beta }_{-G,S} - \beta _{-G,S}^0\Vert _1. \end{aligned}$$

In particular, $\Vert \hat{\beta }_{-G,N} - \beta _{-G,N}^0\Vert _1 = \Vert \hat{\beta }_{-G,N}\Vert _1 \le 3\Vert \hat{\beta }_{-G,S} - \beta _{-G,S}^0\Vert _1$, so from (A1),

$$\begin{aligned} \frac{1}{n}\Vert (I-P_G)\mathbf {X}_{-G}(\hat{\beta }_{-G} - \beta _{-G}^0)\Vert _2^2&+ \frac{\lambda }{2}\Vert \hat{\beta }_{-G,N}\Vert _1 \\&\le \frac{3\lambda }{2}\Vert \hat{\beta }_{-G,S} - \beta _{-G,S}^0\Vert _1 \\&\le \frac{3\lambda }{2}\frac{s^{1/2}\Vert (I-P_G)\mathbf {X}_{-G}(\hat{\beta }_{-G} - \beta _{-G}^0)\Vert _2}{n^{1/2}\phi _0}. \end{aligned}$$

Thus

$$\begin{aligned} \frac{1}{n^{1/2}}\Vert (I-P_G)\mathbf {X}_{-G}(\hat{\beta }_{-G} - \beta _{-G}^0)\Vert _2 \le \frac{3\lambda s^{1/2}}{2\phi _0}. \end{aligned}$$

We conclude that

$$\begin{aligned} \frac{1}{n}\Vert (I-P_G)\mathbf {X}_{-G}(\hat{\beta }_{-G} - \beta _{-G}^0)\Vert _2^2&+ \frac{\lambda }{2}\Vert \hat{\beta }_{-G} - \beta _{-G}^0\Vert _1 \\&\le 2\lambda \Vert \hat{\beta }_{-G,S} - \beta _{-G,S}^0\Vert _1 \\&\le \frac{2\lambda s^{1/2}\Vert (I-P_G)\mathbf {X}_{-G}(\hat{\beta }_{-G} - \beta _{-G}^0)\Vert _2}{n^{1/2}\phi _0} \\&\le \frac{3A^2}{\phi _0^2}\frac{\sigma ^2s\log p}{n}, \end{aligned}$$

as required.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Lockhart, R.A., Samworth, R.J. Comments on: High-dimensional simultaneous inference with the bootstrap. TEST 26, 734–739 (2017). https://doi.org/10.1007/s11749-017-0555-1

Download citation

Published: 09 October 2017
Issue Date: December 2017
DOI: https://doi.org/10.1007/s11749-017-0555-1

Keywords

Mathematics Subject Classification

62E20

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Comments on: High-dimensional simultaneous inference with the bootstrap

Abstract

Similar content being viewed by others

Comments on: High-dimensional simultaneous inference with the bootstrap

Rejoinder on: High-dimensional simultaneous inference with the bootstrap

Comments on: High-dimensional simultaneous inference with the bootstrap

1 Why penalise coefficients of variables of interest?

Theorem 1

Corollary 2

Proof

2 More complicated settings

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Proof of Theorem 1

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Comments on: High-dimensional simultaneous inference with the bootstrap

Abstract

Similar content being viewed by others

Comments on: High-dimensional simultaneous inference with the bootstrap

Rejoinder on: High-dimensional simultaneous inference with the bootstrap

Comments on: High-dimensional simultaneous inference with the bootstrap

1 Why penalise coefficients of variables of interest?

Theorem 1

Corollary 2

Proof

2 More complicated settings

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Appendix

Proof of Theorem 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation