Screen then select: a strategy for correlated predictors in high-dimensional quantile regression

Jiang, Xuejun; Liang, Yakun; Wang, Haofeng

doi:10.1007/s11222-024-10424-6

Screen then select: a strategy for correlated predictors in high-dimensional quantile regression

Original Paper
Published: 08 April 2024

Volume 34, article number 112, (2024)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Xuejun Jiang¹,
Yakun Liang¹ &
Haofeng Wang²

210 Accesses
Explore all metrics

Abstract

Strong correlation among predictors and heavy-tailed noises pose a great challenge in the analysis of ultra-high dimensional data. Such challenge leads to an increase in the computation time for discovering active variables and a decrease in selection accuracy. To address this issue, we propose an innovative two-stage screen-then-select approach and its derivative procedure based on a robust quantile regression with sparsity assumption. This approach initially screens important features by ranking quantile ridge estimation and subsequently employs a likelihood-based post-screening selection strategy to refine variable selection. Additionally, we conduct an internal competition mechanism along the greedy search path to enhance the robustness of algorithm against the design dependence. Our methods are simple to implement and possess numerous desirable properties from theoretical and computational standpoints. Theoretically, we establish the strong consistency of feature selection for the proposed methods under some regularity conditions. In empirical studies, we assess the finite sample performance of our methods by comparing them with utility screening approaches and existing penalized quantile regression methods. Furthermore, we apply our methods to identify genes associated with anticancer drug sensitivities for practical guidance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithm 1

Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range

Article Open access 19 December 2014

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Article 30 August 2016

Statistical significance: p value, 0.05 threshold, and applications to radiomics—reasons for a conservative approach

Article Open access 11 March 2020

Data availibility

The publicly available Cancer Cell Line Encyclopedia (CCLE) dataset is obtained from https://sites.broadinstitute.org/ccle.

References

Barretina, J., Caponigro, G., Stransky, N., Venkatesan, K., Margolin, A.A., Kim, S., Wilson, C.J., Lehár, J., Kryukov, G.V., Sonkin, D., et al.: The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603–607 (2012)
Article Google Scholar
Bermingham, M.L., Pong-Wong, R., Spiliopoulou, A., Hayward, C., Rudan, I., Campbell, H., Wright, A.F., Wilson, J.F., Agakov, F., Navarro, P., Haley, C.S.: Application of high-dimensional feature selection: evaluation for genomic prediction in man. Sci. Rep. (2015). https://doi.org/10.1038/srep10312
Article Google Scholar
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3, 1–122 (2011). https://doi.org/10.1561/2200000016
Article Google Scholar
Buccini, A., Dell’Acqua, P., Donatelli, M.: A general framework for admm acceleration. Numer. Algorithms 85, 829–848 (2020). https://doi.org/10.1007/s11075-019-00839-y
Article MathSciNet Google Scholar
Bühlmann, P., Van De Geer, S.: Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Berlin (2011)
Book Google Scholar
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001). https://doi.org/10.1198/016214501753382273
Article MathSciNet Google Scholar
Fan, J., Lv, J.: Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B Stat. Methodol. 70, 849–883 (2008). https://doi.org/10.1111/j.1467-9868.2008.00674.x
Article MathSciNet Google Scholar
Fan, J., Fan, Y., Barut, E.: Adaptive robust variable selection. Ann. Stat. 42, 324–351 (2014). https://doi.org/10.1214/13-AOS1191
Article MathSciNet Google Scholar
Fang, E.X., He, B., Liu, H., Yuan, X.: Generalized alternating direction method of multipliers: new theoretical insights and applications. Math. Program. Comput. 7, 149–187 (2015)
Article MathSciNet Google Scholar
Hastie, T.: Ridge regularization: an essential concept in data science. Technometrics 62, 426–433 (2020). https://doi.org/10.1080/00401706.2020.1791959
Article MathSciNet Google Scholar
He, J., Kang, J.: Prior knowledge guided ultra-high dimensional variable screening with application to neuroimaging data. Stat. Sin. 32, 2095–2117 (2022). https://doi.org/10.5705/ss.202020.0427
Article MathSciNet Google Scholar
He, X., Wang, L., Hong, H.G.: Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Ann. Stat. 41, 342–369 (2013). https://doi.org/10.1214/13-AOS1087
Article MathSciNet Google Scholar
Hoerl, A., Kennard, R.: Ridge regression-biased estimation for nonorthogonal problems. Technometrics 12, 55 (1970). https://doi.org/10.1080/00401706.1970.10488634
Article Google Scholar
Honda, T., Lin, C.T.: Forward variable selection for ultra-high dimensional quantile regression models. Ann. Inst. Stat. Math. 75(3), 393–424 (2023). https://doi.org/10.1007/s10463-022-00849-z
Koenker, R., Bassett, G.: Regression quantiles. Econometrica 46, 33–50 (1978). https://doi.org/10.2307/1913643
Article MathSciNet Google Scholar
Koenker, R., Machado, J.A.: Goodness of fit and related inference processes for quantile regression. J. Am. Stat. Assoc. 94, 1296–1310 (1999)
Article MathSciNet Google Scholar
Kong, Y., Li, Y., Zerom, D.: Screening and selection for quantile regression using an alternative measure of variable importance. J. Multivar. Anal. 173, 435–455 (2019). https://doi.org/10.1016/j.jmva.2019.04.007
Article MathSciNet Google Scholar
Lee, E.R., Noh, H., Park, B.U.: Model selection via Bayesian information criterion for quantile regression models. J. Am. Stat. Assoc. 109, 216–229 (2014). https://doi.org/10.1080/01621459.2013.836975
Article MathSciNet Google Scholar
Li, R., Zhong, W., Zhu, L.: Feature screening via distance correlation learning. J. Am. Stat. Assoc. 107, 1129–1139 (2012). https://doi.org/10.1080/01621459.2012.695654
Article MathSciNet Google Scholar
Liu, W., Ke, Y., Liu, J., Li, R.: Model-free feature screening and fdr control with knockoff features. J. Am. Stat. Assoc. 117, 428–443 (2022). https://doi.org/10.1080/01621459.2020.1783274
Article MathSciNet Google Scholar
Lorbert, A., Eis, D., Kostina, V., Blei, D., Ramadge, P.: Exploiting covariate similarity in sparse regression via the pairwise elastic net. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, pp. 477–484 (2010)
Ma, X., Zhang, J.: Robust model-free feature screening via quantile correlation. J. Multivar. Anal. 143, 472–480 (2016). https://doi.org/10.1016/j.jmva.2015.10.010
Article MathSciNet Google Scholar
Ma, S., Li, R., Tsai, C.L.: Variable screening via quantile partial correlation. J. Am. Stat. Assoc. 112, 650–663 (2017). https://doi.org/10.1080/01621459.2016.1156545
Article MathSciNet Google Scholar
Meinshausen, N., Rocha, G., Yu, B.: Discussion: a tale of three cousins: Lasso, l2boosting and dantzig. Ann. Stat. 35, 2373–2384 (2007)
Article Google Scholar
Mkhadri, A., Ouhourane, M.: An extended variable inclusion and shrinkage algorithm for correlated variables. Comput. Stat. Data Anal. 57, 631–644 (2013). https://doi.org/10.1016/j.csda.2012.07.023
Article MathSciNet Google Scholar
Scheetz, T.E., Kim, K.Y.A., Swiderski, R.E., Philp, A.R., Braun, T.A., Knudtson, K.L., Dorrance, A.M., DiBona, G.F., Huang, J., Casavant, T.L., Sheffield, V.C., Stone, E.M.: Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proc. Natl. Acad. Sci. USA 103, 14429–14434 (2006). https://doi.org/10.1073/pnas.0602562103
Article Google Scholar
Sherwood, B., Li, S.: Quantile regression feature selection and estimation with grouped variables using Huber approximation. Stat. Comput. 32, 4 (2022). https://doi.org/10.1007/s11222-022-10135-w
Article MathSciNet Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 58, 267–288 (1996). https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
Article MathSciNet Google Scholar
Wang, H.: Forward regression for ultra-high dimensional variable screening. J. Am. Stat. Assoc. 104, 1512–1524 (2009). https://doi.org/10.1198/jasa.2008.tm08516
Article MathSciNet Google Scholar
Wang, H., Jin, H., Jiang, X.: Feature selection for high-dimensional varying coefficient models via ordinary least squares projection. Commun. Math. Stat. (2023). https://doi.org/10.1007/s40304-022-00326-2
Wang, X., Leng, C.: High dimensional ordinary least squares projection for screening variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 78, 589–611 (2016). https://doi.org/10.1111/rssb.12127
Article MathSciNet Google Scholar
Wu, Y., Yin, G.: Conditional quantile screening in ultrahigh-dimensional heterogeneous data. Biometrika 102, 65–76 (2015). https://doi.org/10.1093/biomet/asu068
Article MathSciNet Google Scholar
Wu, Y., Zen, M.: A strongly consistent information criterion for linear model selection based on m-estimation. Probab. Theory Relat. Fields 113, 599–625 (1999). https://doi.org/10.1007/s004400050219
Article MathSciNet Google Scholar
Zhang, C.H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38, 894–942 (2010). https://doi.org/10.1214/09-AOS729
Article MathSciNet Google Scholar
Zhao, Y., Zhang, J., Tian, Y., Xue, C., Hu, Z., Zhang, L.: Met tyrosine kinase inhibitor, pf-2341066, suppresses growth and invasion of nasopharyngeal carcinoma. Drug Des. Dev. Ther. 9, 4897 (2015)
Google Scholar
Zhou, T., Zhu, L., Xu, C., Li, R.: Model-free forward screening via cumulative divergence. J. Am. Stat. Assoc. 115, 1393–1405 (2020). https://doi.org/10.1080/01621459.2019.1632078
Article MathSciNet Google Scholar
Zhu, L.P., Li, L., Li, R., Zhu, L.X.: Model-free feature screening for ultrahigh-dimensional data. J. Am. Stat. Assoc. 106, 1464–1475 (2011). https://doi.org/10.1198/jasa.2011.tm10563
Article MathSciNet Google Scholar
Zoppoli, G., Regairaz, M., Leo, E., Reinhold, W.C., Varma, S., Ballestrero, A., Doroshow, J.H., Pommier, Y.: Putative dna/rna helicase schlafen-11 (slfn11) sensitizes cancer cells to dna-damaging agents. Proc. Natl. Acad. Sci. USA 109, 15030–15035 (2012). https://doi.org/10.1073/pnas.1205943109
Article Google Scholar
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67, 301–320 (2005). https://doi.org/10.1111/j.1467-9868.2005.00503.x
Article MathSciNet Google Scholar

Download references

Acknowledgements

This research was supported by NSFC grants 12271238, Guangdong NSF Fund 2023A1515010025, and Shenzhen Sci-Tech Fund (JCYJ20210324104803010) for Xuejun Jiang.

Author information

Authors and Affiliations

Department of Statistics and Data Science, Southern University of Science and Technology, Shenzhen, Guangdong, China
Xuejun Jiang & Yakun Liang
Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
Haofeng Wang

Authors

Xuejun Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Yakun Liang
View author publications
You can also search for this author in PubMed Google Scholar
Haofeng Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization, XJ and HF; methodology, YK, HF and XJ; software, YK; resources, XJ; data curation, YK; writing-original draft preparation, YK and HF; supervision, XJ; funding acquisition, XJ.

Corresponding author

Correspondence to Haofeng Wang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Xuejun Jiang, Yakun Liang and Haofeng Wang have equally contributed to this work.

Appendices

Appendix A. Useful lemmas

We first introduce a notation $\varvec{\beta }^*$ for the true regression coefficient vector which is commonly imposed to be sparse with only a small proportion of nonzeros. Lemma 1 is used to prove Proposition 1 and Theorem 1. Lemmas 2 and 3 are useful to prove Theorems 2–3.

Lemma 1

Suppose assumptions A1 and A2 hold. If the dimension $p_n$ satisfies $\log (p_n) = o({n^{1-5\omega -2\kappa -v}}/{\log (n)})$, then there exist some constants c, $\tilde{c}$, $c'_{1}$, and $c'_{2}$ such that

(a)
For any fixed vector $\varvec{t}$ with $\Vert \varvec{t}\Vert _{2}=1$,
$$\begin{aligned}&P\left( \varvec{t}^\top \mathbb {P}_{\mathbb {X}^\top }\varvec{t}<c_{1}'n^{1-\omega }p_n^{-1} \ \text {or} \ \ \varvec{t}^\top \mathbb {P}_{\mathbb {X}^\top }\varvec{t}>c_{2}'n^{1+\omega }p_n^{-1}\right) \\&\quad \le 4\exp (-C_1n); \end{aligned}$$
(b)
$P\left( \Vert (\mathbb {X}\mathbb {X}^\top )^{-1}\mathbb {X}\varvec{e}_{i}\Vert _{2}^2>c_{1}c_{2}'n^{1+2\omega }p_n^{-2}\right) {<}3\exp (-C_{1}n);$
(c)
$P\left( \min _{i\in \mathcal {S}^*}\left| \varvec{e}_{i}^\top \mathbb {P}_{\mathbb {X}^\top }\varvec{\beta }^*\right| <\dfrac{cn^{1-\omega -\kappa }}{p_{n}}\right) =$ $O\left\{ \exp \left( \dfrac{-C_{1}n^{1-5\omega -2\kappa -v}}{2\log n}\right) \right\} ;$
(d)
$P\left( \max _{i\notin \mathcal {S}^{*}}\left| \varvec{e}_{i}^\top \mathbb {P}_{\mathbb {X}^\top }\varvec{\beta }^*\right| >\dfrac{\tilde{c} n^{1-\omega -\kappa }}{p_{n}\sqrt{\log n}} \right) =$ $ O\left\{ \exp \left( \dfrac{-C_{1}n^{1-5\omega -2\kappa -v}}{2\log n}\right) \right\} ;$
(e)
$P\Bigg (\lambda _{\max }(\mathbb {X}\mathbb {X}^\top )\ge c_{1}c_{4}c_{5} p_{n}n^{\omega }, \lambda _{\min }(\mathbb {X}\mathbb {X}^\top )\le c_{1}^{-1}c_{4}^{-1}c_{5}p_nn^{-\omega } \Bigg )\le 2\exp (-C_{1}n);$

where $\varvec{e}_i=(0,\ldots ,1,0,\ldots ,0)^\top $ be the i-th natural base in the $p_n$-dimensional Euclidean space, $\omega ,\kappa ,v$ are parameters defined in assumption A2, $C_1$ is defined in assumption A1, and $\mathbb {P}_{\mathbb {X}^{\top }} = \mathbb {X}^{\top }(\mathbb {X}\mathbb {X}^{\top })^{-1}\mathbb {X}$ represents the projection matrix.

Proof of Lemma 1

(a) and (b) are obtained by Lemma 4 and formula (22) in the Supplementary of Wang and Leng (2016), respectively.

For (c) and (d). By Lemma 5 in the Supplementary of Wang and Leng (2016), there exist some $c,\tilde{c}>0$ such that for $i\in \mathcal {S}^*$,

$$\begin{aligned}{} & {} P\left( \left| \varvec{e}_{i}^\top \mathbb {P}_{\mathbb {X}^\top }\varvec{\beta }^*\right| <\dfrac{cn^{1-\omega -\kappa }}{p_{n}}\right) \\{} & {} \quad = O\left\{ \exp \left( \dfrac{-C_{1}n^{1-5\omega -2\kappa -v}}{2\log n}\right) \right\} , \end{aligned}$$

and for $i\notin \mathcal {S}^*$,

$$\begin{aligned}{} & {} P\left( \left| \varvec{e}_{i}^\top \mathbb {P}_{\mathbb {X}^\top }\varvec{\beta }^*\right| >\frac{\tilde{c} n^{1-\omega -\kappa }}{p_{n}\sqrt{\log n}} \right) \\{} & {} \quad =O\left\{ \exp \left( \dfrac{-C_{1}n^{1-5\omega -2\kappa -v}}{2\log n}\right) \right\} . \end{aligned}$$

Applying assumption A2 with Bonferroni’s inequality, we have

$$\begin{aligned}{} & {} P\left( \min _{i\in \mathcal {S}^*}\left| \varvec{e}_{i}^\top \mathbb {P}_{\mathbb {X}^\top }\varvec{\beta }^*\right| <\dfrac{cn^{1-\omega -\kappa }}{p_{n}}\right) \\{} & {} \quad \le O\left\{ s_{n}\exp \left( \frac{-C_{1}n^{1-5\omega -2\kappa -v}}{2\log n}\right) \right\} \\{} & {} \quad =O\left\{ \exp \left( \frac{-C_{1}n^{1-5\omega -2\kappa -v}}{2\log n}\right) \right\} \end{aligned}$$

and

$$\begin{aligned}{} & {} P\left( \max _{i\notin \mathcal {S}^*}\left| \varvec{e}_{i}^\top \mathbb {P}_{\mathbb {X}^\top }\varvec{\beta }^*\right|>\frac{\tilde{c} n^{1-\omega -\kappa }}{p_{n}\sqrt{\log n}} \right) \\{} & {} \quad \le \sum _{i\notin \mathcal {S}^*}P\left( \left| \varvec{e}_{i}^\top \mathbb {P}_{\mathbb {X}^\top }\varvec{\beta }^*\right| >\frac{\tilde{c} n^{1-\omega -\kappa }}{p_{n}\sqrt{\log n}} \right) \\{} & {} \quad =O\left\{ \exp \left( \frac{-C_{1}n^{1-5\omega -2\kappa -v}}{2\log n}\right) \right\} . \end{aligned}$$

For (e), by assumption A2, we have that

$$\begin{aligned}{} & {} \lambda _{\max }\left( \mathbb {X}\mathbb {X}^\top \right) =\lambda _{\max }\left( \mathbb {W}\varvec{\Sigma }\mathbb {W}^\top \right) \\{} & {} \quad \le \lambda _{\max }\left( \varvec{\Sigma }\right) \lambda _{\max }\left( \mathbb {W}\mathbb {W}^\top \right) \\{} & {} \quad \le c_{5}c_{4}n^{\omega } \lambda _{\max }\left( \mathbb {W}\mathbb {W}^\top \right) , \end{aligned}$$

and

$$\begin{aligned}{} & {} \lambda _{\min }\left( \mathbb {X}\mathbb {X}^\top \right) =\lambda _{\min }\left( \mathbb {W}\varvec{\Sigma }\mathbb {W}^\top \right) \\{} & {} \quad \ge \lambda _{\min }\left( \varvec{\Sigma }\right) \lambda _{\min }\left( \mathbb {W}\mathbb {W}^\top \right) \\{} & {} \quad \ge c_{5}c_{4}^{-1}n^{-\omega } \lambda _{\min }\left( \mathbb {W}\mathbb {W}^\top \right) . \end{aligned}$$

Combined with assumption of eigenvalues in A1, we have

$$\begin{aligned} \begin{aligned}&P\left( \lambda _{\max }\left( \mathbb {X}\mathbb {X}^\top \right) \ge c_{1}c_{5}c_{4}n^{\omega }p_n\right) \\&\le P\left( \lambda _{\max }\left( \mathbb {W}\mathbb {W}^\top \right) \ge c_{1}p_n\right) \le \exp (-C_{1}n)\\&P\left( \lambda _{\min }\left( \mathbb {X}\mathbb {X}^\top \right) \le c_{1}c_{5}c_{4}^{-1}n^{-\omega }p_n\right) \\&\le P\left( \lambda _{\min }\left( \mathbb {W}\mathbb {W}^\top \right) \le c_{1}^{-1}p_n\right) \le \exp (-C_{1}n). \end{aligned} \end{aligned}$$

This proves the lemma. $\square $

Lemma 2

For $\mathcal {S}^*\not \subset \mathcal {S}$, denote $\hat{\varvec{\beta }}_{\mathcal {S}} = \arg \min _{\varvec{\beta }_{\mathcal {S}}\in \mathbb {R}^{|\mathcal {S}|}}n^{-1}\sum _{i=1}^{n}\rho _{\tau }(Y_i-\varvec{X}^{\top }_{i,\mathcal {S}}\varvec{\beta }_{\mathcal {S}})$, and the pseudo true coefficient $\tilde{\varvec{\beta }}_{\mathcal {S}}= \arg \min _{\varvec{\beta }_{\mathcal {S}}\in \mathbb {R}^{|\mathcal {S}|}}E\left[ n^{-1}\sum _{i=1}^{n}\rho _{\tau }(Y_i-\varvec{X}^{\top }_{i,\mathcal {S}}\varvec{\beta }_{\mathcal {S}})\right] $ on the support of the model $\mathcal {S}$. If assumption A2-A4 hold, then

$$\begin{aligned} \sup _{|\mathcal {S}|\le d}\Vert \hat{\varvec{\beta }}_{\mathcal {S}}-\tilde{\varvec{\beta }}_{\mathcal {S}}\Vert _{2} = O_p\left( \sqrt{\dfrac{|\mathcal {S}|\log (n)\log (p_n)}{n}}\right) \end{aligned}$$

uniformly in $\mathcal {S}$ as $n\rightarrow \infty $ for $|\mathcal {S}|\le d$ and $d=O(n^{1/2})$.

Proof of Lemma 2

For given a deterministic $\gamma >0$, we first define the set $\mathcal {B}_{\gamma } = \left\{ \varvec{\beta }_{\mathcal {S}}\in \mathbb {R}^{|\mathcal {S}|}: \Vert {\varvec{\beta }}_{\mathcal {S}}-\tilde{\varvec{\beta }}_{\mathcal {S}}\Vert _2 \le \gamma \right\} $, and the function

$$\begin{aligned} D_{\gamma } = \sup _{\varvec{\beta }_{\mathcal {S}}\in \mathcal {B}_{\gamma }} \left| Q_n(\varvec{\beta }_{\mathcal {S}}) - Q_n(\tilde{\varvec{\beta }}_{\mathcal {S}}) - E\left[ Q_n(\varvec{\beta }_{\mathcal {S}}) - Q_n(\tilde{\varvec{\beta }}_{\mathcal {S}})\right] \right| . \end{aligned}$$

Using Knight’s identity:

$$\begin{aligned}&{} \rho _{\tau }(u-v)-\rho _{\tau }(u) = -v\psi _{\tau }(u)\nonumber \\{}&{} +\int _{0}^{v}\{I(u\le s)-I(u\le 0)\} \text {d}s,\end{aligned}$$

(9)

where $\psi _{\tau }(h) = \tau - I(h<0)$. Let $u_i = Y_i - \varvec{X}^{\top }_{i,\mathcal {S}}\tilde{\varvec{\beta }}_{\mathcal {S}}$ and $v_i=\varvec{X}^{\top }_{i,\mathcal {S}}({\varvec{\beta }}_{\mathcal {S}}-\tilde{\varvec{\beta }}_{\mathcal {S}})$, then

$$\begin{aligned} \begin{aligned}&n\left[ Q_n(\varvec{\beta }_{\mathcal {S}}) - Q_n(\tilde{\varvec{\beta }}_{\mathcal {S}})\right] \\&\quad =\sum _{i=1}^{n} \varvec{X}^{\top }_{i,\mathcal {S}}(\tilde{\varvec{\beta }}_{\mathcal {S}}-{\varvec{\beta }}_{\mathcal {S}})\psi _{\tau }(Y_i - \varvec{X}^{\top }_{i,\mathcal {S}}\tilde{\varvec{\beta }}_{\mathcal {S}}) \\&\qquad + \sum _{i=1}^n \int _{0}^{\varvec{X}^{\top }_{i,\mathcal {S}}({\varvec{\beta }}_{\mathcal {S}}-\tilde{\varvec{\beta }}_{\mathcal {S}})}\\&\qquad \{I(Y_i - \varvec{X}^{\top }_{i,\mathcal {S}}\tilde{\varvec{\beta }}_{\mathcal {S}}\le s)-I(Y_i - \varvec{X}^{\top }_{i,\mathcal {S}}\tilde{\varvec{\beta }}_{\mathcal {S}}\le 0)\} \textrm{d}s\\&\quad =I_{a1}+I_{a2}. \end{aligned} \end{aligned}$$

For $I_{a1}$, $E(I_{a1}|\varvec{X})=0$ due to the first order condition. For $I_{a2}$, by Fubini’s theorem, the mean value theorem, and assumption A2 and A4,

$$\begin{aligned} \begin{aligned} E(I_{a2}|\varvec{X})&= \sum _{i=1}^n \int _{0}^{\varvec{X}^{\top }_{i,\mathcal {S}}({\varvec{\beta }}_{\mathcal {S}}-\tilde{\varvec{\beta }}_{\mathcal {S}})}\\&\qquad \{P(Y_i - \varvec{X}^{\top }_{i,\mathcal {S}}\tilde{\varvec{\beta }}_{\mathcal {S}}\le s)-P(Y_i - \varvec{X}^{\top }_{i,\mathcal {S}}\tilde{\varvec{\beta }}_{\mathcal {S}}\le 0)\} \textrm{d}s\\&\ge \sum _{i=1}^n \frac{1}{2}\underline{f}\left[ \varvec{X}^{\top }_{i,\mathcal {S}}(\tilde{\varvec{\beta }}_{\mathcal {S}}-{\varvec{\beta }}_{\mathcal {S}})\right] ^2\\&=\frac{1}{2}\underline{f} ( {\varvec{\beta }}_{\mathcal {S}} - \tilde{\varvec{\beta }}_{\mathcal {S}} )^{\top }\sum _{i=1}^{n}(\varvec{X}_{i,\mathcal {S}}\varvec{X}_{i,\mathcal {S}}^{\top }) ( {\varvec{\beta }}_{\mathcal {S}} - \tilde{\varvec{\beta }}_{\mathcal {S}})\\&\ge C_f n\Vert {\varvec{\beta }}_{\mathcal {S}} - \tilde{\varvec{\beta }}_{\mathcal {S}} \Vert _{2}^2. \end{aligned} \end{aligned}$$

for some positive constant $C_f$. It means that

$$\begin{aligned} E[Q_n(\varvec{\beta }_{\mathcal {S}}) - Q_n(\tilde{\varvec{\beta }}_{\mathcal {S}})]\ge C_f\Vert {\varvec{\beta }}_{\mathcal {S}} - \tilde{\varvec{\beta }}_{\mathcal {S}} \Vert _{2}^2. \end{aligned}$$

(10)

If we define a convex combination ${\varvec{\theta }}_{\mathcal {S}} = a\hat{\varvec{\beta }}_{\mathcal {S}} + (1-a)\tilde{\varvec{\beta }}_\mathcal {S}$ with $a=\gamma /(\gamma + \Vert \hat{\varvec{\beta }}_{\mathcal {S}}-\tilde{\varvec{\beta }}_\mathcal {S}\Vert _{2})$, by definition of ${\varvec{\theta }}_{\mathcal {S}}$, $\Vert {\varvec{\theta }}_{\mathcal {S}} - \tilde{\varvec{\beta }}_\mathcal {S}\Vert _{2} = a\Vert \hat{\varvec{\beta }}_{\mathcal {S}} - \tilde{\varvec{\beta }}_\mathcal {S}\Vert _{2}\le \gamma $, which falls in the set $\mathcal {B}_{\gamma }$. Then by the convexity and the definition of $\hat{\varvec{\beta }}_{\mathcal {S}}$,

$$\begin{aligned} Q_{n} ({\varvec{\theta }}_{\mathcal {S}})\le aQ_n(\hat{\varvec{\beta }}_{\mathcal {S}}) +(1-a)Q_n(\tilde{\varvec{\beta }}_\mathcal {S})\le Q_n(\tilde{\varvec{\beta }}_\mathcal {S}). \end{aligned}$$

Using this and the triangle inequality, we have

$$\begin{aligned} E\left[ Q_{n} ({\varvec{\theta }}_{\mathcal {S}}) - Q_n(\tilde{\varvec{\beta }}_\mathcal {S})\right]= & {} \left\{ Q_n(\tilde{\varvec{\beta }}_\mathcal {S}) - E[Q_n(\tilde{\varvec{\beta }}_\mathcal {S})]\right\} \nonumber \\{} & {} - \left\{ Q_n({\varvec{\theta }}_{\mathcal {S}}) - E[Q_n({\varvec{\theta }}_{\mathcal {S}})]\right\} \nonumber \\{} & {} + Q_n({\varvec{\theta }}_{\mathcal {S}}) - Q_n(\tilde{\varvec{\beta }}_\mathcal {S})\nonumber \\\le & {} D_{\gamma }. \end{aligned}$$

(11)

Note that $\Vert {\varvec{\theta }}_{\mathcal {S}} - \tilde{\varvec{\beta }}_\mathcal {S}\Vert _{2}\le \gamma /2$ implies $\Vert \hat{\varvec{\beta }}_{\mathcal {S}} - \tilde{\varvec{\beta }}_\mathcal {S}\Vert _{2}\le \gamma $. Denote $\gamma _n = \sqrt{|\mathcal {S}|\log (n)\log (p_n)/n}$ and some positive constant $C_\gamma $. Combining (10) with (11), we have

$$\begin{aligned}{} & {} P\left( \Vert \hat{\varvec{\beta }}_{\mathcal {S}} - \tilde{\varvec{\beta }}_{\mathcal {S}} \Vert _{2} \ge C_\gamma \gamma _n \right) \le P\left( \Vert {\varvec{\theta }}_{\mathcal {S}} - \tilde{\varvec{\beta }}_{\mathcal {S}} \Vert _{2} \ge C_\gamma \gamma _n/2 \right) \\{} & {} \quad \le P\left( D_{\gamma } \ge C_fC_\gamma ^2 \gamma _n^2 /4 \right) . \end{aligned}$$

Similar to the argument of Lemma 1 of Fan et al. (2014), we have $E(D_{\gamma })\le 4\gamma \sqrt{|\mathcal {S}|/n}$ after employing the standard symmetrization and contraction theorem, see in Section 14.7 of Bühlmann and Van De Geer (2011). Applying Massart’s concentration theorem, see in Section 14.6 of Bühlmann and Van De Geer (2011), yields that for any $t>0$, $P([D_\gamma - E(D_\gamma )]/V_n \ge t)\le \exp (-nt^2/8)$, where $V_n=2C_x\sqrt{|\mathcal {S}|}\gamma $, $C_x$ is a constant greater than $\max _{i,j}|X_{ij}|$. It is equivalent to

$$\begin{aligned} P\left( D_\gamma \ge 4\gamma \sqrt{|\mathcal {S}|/n}(1+t) \right) \le \exp (-2t^2). \end{aligned}$$

(12)

Let $\gamma = 16C_f^{-1}n^{-1/2}(1+t)$, $1+t = C_\gamma C_f \sqrt{|\mathcal {S}|\log (n)\log (p_n)}/16$, it follows (12) that

$$\begin{aligned}{} & {} P\left( \Vert \hat{\varvec{\beta }}_{\mathcal {S}} - \tilde{\varvec{\beta }}_{\mathcal {S}} \Vert _{2}\right. \\{} & {} \left. \ge C_\gamma \gamma _n\right) \le P\left( D_{\gamma } \ge \dfrac{C_{\gamma }^2C_f}{4} \dfrac{|\mathcal {S}|\log (n)\log (p_n)}{n}\right) \\{} & {} \quad \le \exp \left( -C_{\gamma }'|\mathcal {S}|\log (n)\log (p_n)\right) . \end{aligned}$$

with some positive constant $C_{\gamma }'$. Moreover, using Boole’s inequality, we have

$$\begin{aligned} \begin{aligned} P\left( \sup _{|\mathcal {S}|\le d} \Vert \hat{\varvec{\beta }}_{\mathcal {S}}-\tilde{\varvec{\beta }}_{\mathcal {S}}\Vert _{2} \ge C_{\gamma }\gamma _n\right)&\le \sum _{|\mathcal {S}|\le d} P\left( \Vert \hat{\varvec{\beta }}_{\mathcal {S}} - \tilde{\varvec{\beta }}_{\mathcal {S}} \Vert _{2} \ge C_{\gamma }\gamma _n\right) \\&\le \sum _{|\mathcal {S}|\le d_0} \left( {\begin{array}{c}p_n\\ d_0\end{array}}\right) p_n^{-C_{\gamma }' d_0 \log (n)}\\&\le \sum _{d_0=1}^{d}\left( \frac{p_n e}{d_0}\right) ^{d_0} p_n^{-C_{\gamma }' d_0 \log (n)}\rightarrow 0.\\ \end{aligned} \end{aligned}$$

Thus, the proof is complete. $\square $

Lemma 3

Suppose that the assumptions A2-A4 hold. For $\mathcal {S}^*\subset \mathcal {S}$, denote $\varvec{\beta }^*_{\mathcal {S}}$ as the true coefficient vector $\varvec{\beta }^*$ truncated by the support of $\mathcal {S}$. Then we have

$$\begin{aligned} \sup _{|\mathcal {S}|\le d}\Vert \hat{\varvec{\beta }}_{\mathcal {S}}-{\varvec{\beta }}_{\mathcal {S}}^*\Vert _{2} = O_p\left( \sqrt{\dfrac{|\mathcal {S}|\log (n)\log (p_n)}{n}}\right) \end{aligned}$$

uniformly in $\mathcal {S}$ as $n\rightarrow \infty $ for $|\mathcal {S}|\le d$ and $d=O(n^{1/2})$.

Proof of Lemma 3

Define $h_{\mathcal {S}}(\Delta \varvec{\beta }_{\mathcal {S}})=\sum _{i=1}^{n}\{\rho _{\tau }(\epsilon _{i}-\varvec{X}_{i,\mathcal {S}}^{\top }\Delta \varvec{\beta }_{\mathcal {S}} - \rho _{\tau }(\epsilon _{i}))\}$, $\Delta \varvec{\beta }_{\mathcal {S}}=\varvec{\beta }_{\mathcal {S}}-\varvec{\beta }_{\mathcal {S}}^*$. By its convexity, it is sufficient to show that for any given D, there exists a large constant $L_D >0$ such that

$$\begin{aligned} \liminf _n P\left( \inf _{\mathcal {S}: |\mathcal {S}| \le d} \inf _{\varvec{\beta }_{\mathcal {S}}:\Vert \Delta \varvec{\beta }_{\mathcal {S}} \Vert =L_D\gamma _{n}} h_{\mathcal {S}}\left( \Delta \varvec{\beta }_{\mathcal {S}}\right)>0\right) >1-D, \end{aligned}$$

(13)

where $\gamma _n = \sqrt{|\mathcal {S}|\log (n)\log (p_n)/n}$.

Referring to Lemma A.1 of Supplementary of Lee et al. (2014), we have for any sequence $\{D_n\}$ that satisfies $1\le D_n \le d^{\delta _0/10}$ for some $\delta _0>0$ with $d^{2+\delta _0}=o(n)$ such that

$$\begin{aligned}{} & {} \sup _{|\mathcal {S}| \le d} \sup _{\left\| \Delta \varvec{\beta }_{\mathcal {S}}\right\| \le D_n \sqrt{|\mathcal {S}| / n} }\\{} & {} \left| |\mathcal {S}|^{-1} \left[ h_{\mathcal {S}}(\Delta \varvec{\beta }_{\mathcal {S}})+A_n\left( \Delta \varvec{\beta }_{\mathcal {S}}\right) +B_n\left( \Delta \varvec{\beta }_{\mathcal {S}}\right) \right] \right| =o_p(1), \end{aligned}$$

in which

$$\begin{aligned} \begin{aligned} A_n\left( \Delta \varvec{\beta }_{\mathcal {S}}\right)&= \sum _{i=1}^n-\varvec{X}_{i,\mathcal {S}}^{\top } \Delta \varvec{\beta }_{\mathcal {S}}\left( \tau -I\left( \epsilon _{i}<0\right) \right) ,\\ B_n\left( \Delta \varvec{\beta }_{\mathcal {S}}\right)&= \sum _{i=1}^{n}E\left( \rho _\tau \left( \epsilon _{i}-\varvec{X}_{i,\mathcal {S}}^{\top } \Delta \varvec{\beta }_{\mathcal {S}}\right) -\rho _\tau \left( \epsilon _{i}\right) \right) . \end{aligned} \end{aligned}$$

Here, we take $D_n=\sqrt{\log (n)\log (p_n)}$. Thus, $h_{\mathcal {S}}\left( \Delta \varvec{\beta }_{\mathcal {S}}\right) $ can be decomposed by

$$\begin{aligned} h_{\mathcal {S}}\left( \Delta \varvec{\beta }_{\mathcal {S}}\right) =A_n\left( \Delta \varvec{\beta }_{\mathcal {S}}\right) +B_n\left( \Delta \varvec{\beta }_{\mathcal {S}}\right) +|\mathcal {S}| o_p(1) \end{aligned}$$

(14)

for any $\mathcal {S}^*\subset \mathcal {S}$, $|\mathcal {S}|\le d$ and $\Vert \Delta \varvec{\beta }_{\mathcal {S}}\Vert = L_D\gamma _{n}$.

For $A_n\left( \Delta \varvec{\beta }_{\mathcal {S}}\right) $,

$$\begin{aligned} \left| A_n\left( \Delta \varvec{\beta }_{\mathcal {S}}\right) \right| \le \max _{1 \le j \le p}\left| \sum _{i=1}^n X_{ij}\left( \tau - I\left( \epsilon _{i}<0\right) \right) \right| |\mathcal {S}|^{1 / 2} L_D\gamma _{n}. \end{aligned}$$

(15)

By applying Example 14.3 of Bühlmann and Van De Geer (2011), we have

$$\begin{aligned} \max _{1 \le j \le p}\left| \sum _{i=1}^n X_{ij}\left( \tau -I\left( \epsilon _{i}<0\right) \right) \right| =O_p\left( \sqrt{n \log (n)\log (p_n)}\right) . \end{aligned}$$

For $B_n\left( \Delta \varvec{\beta }_{\mathcal {S}}\right) $, using Knight’s identity (see in (9)) and Taylor’s theorem and the assumption A2 and A4, we have

$$\begin{aligned} \begin{aligned} B_n\left( \Delta \varvec{\beta }_{\mathcal {S}}\right)&=\sum _{i=1}^n E\left\{ \int _0^{\varvec{X}_{i,\mathcal {S}}^{\top } \Delta \varvec{\beta }_{\mathcal {S}}} \left( I\left( \epsilon _{i}\le s\right) -I\left( \epsilon _{i}\le 0\right) \right) \textrm{d}s\right\} \\&=\sum _{i=1}^n \int _0^{\varvec{X}_{i,\mathcal {S}}^{\top } \Delta \varvec{\beta }_{\mathcal {S}}} \left\{ P(\epsilon _{i}\le s)-P(\epsilon _{i}\le 0)\right\} \textrm{d}s \\&\ge \dfrac{1}{2}\underline{f} \sum _{i=1}^n\left( \varvec{X}_{i,\mathcal {S}}^{\top } \Delta \varvec{\beta }_{\mathcal {S}}\right) ^2 \\&\ge \dfrac{1}{2}\underline{f} n L_D^2 \gamma _{n}^2. \end{aligned} \end{aligned}$$

(16)

Combined with (15) and (16), we have $\left| A_n\left( \Delta \varvec{\beta }_{\mathcal {S}}\right) \right| \le M_1 L_D n\gamma _{n}^2$, and $B_n\left( \Delta \varvec{\beta }_{\mathcal {S}}\right) \ge M_2 L_D^2n\gamma _{n}^2$ for some constant $M_1$ and $M_2$. Therefore, formula (13) holds since $B_n\left( \Delta \varvec{\beta }_{\mathcal {S}}\right) $ dominates all the other terms in (14) for sufficiently large $L_D>M_1/M_2$, not depending on the choice of $\mathcal {S}$. This completes the proof. $\square $

Appendix B. Proof of Proposition 1

We divide the proof into three parts:

(i) to show that the estimates of QRR $\hat{\varvec{\beta }}\in \mathcal {C}\left( \mathbb {X}^\top \right) $, where $\mathcal {C}\left( \mathbb {X}^\top \right) $ represents the column space spanned by $(\varvec{X}_{1},\ldots ,\varvec{X}_{n})^{\top }$;

(ii) to show that for $n\rightarrow \infty $,

$$\begin{aligned}{} & {} P\left( \Vert \mathbb {X}(\hat{\varvec{\beta }}-\varvec{\xi })\Vert _{\infty }=O\left( n^{-2\omega -\kappa }/\sqrt{\log (n)}\right) ,\right. \nonumber \\{} & {} \left. \Vert \hat{\varvec{\beta }}-\varvec{\xi }\Vert _{2}=O\left( n^{1/2-3\omega /2-\kappa }/\sqrt{p_n\log (n)}\right) \right) \rightarrow 1, \end{aligned}$$

(17)

where $\varvec{\xi }=\mathbb {P}_{\mathbb {X}^\top }{\varvec{\beta }}^*$, $\mathbb {P}_{\mathbb {X}^{\top }} = \mathbb {X}^{\top }(\mathbb {X}\mathbb {X}^{\top })^{-1}\mathbb {X}$;

(iii) to show that $\hat{\varvec{\beta }}=\mathbb {X}^{\top } (\mathbb {X}\mathbb {X}^{\top }+\lambda \mathbb {I}_n)^{-1} \mathbb {X}\varvec{\beta }^*+\varvec{R}_n$.

Part (i). Since $\hat{\varvec{\beta }}=\arg \min _{\varvec{\beta }}\{Q_{n}(\varvec{\beta })+\lambda \Vert \varvec{\beta }\Vert _{2}^2\}$, where $Q_{n}(\varvec{\beta })=n^{-1}\sum _{i=1}^n\rho _\tau (Y_i-\varvec{X}_i^\top \varvec{\beta })$, the first-order condition satisfies that

$$\begin{aligned} 2\lambda \hat{\varvec{\beta }}-n^{-1}\sum _{i=1}^n\varvec{X}_{i}\left( \tau -I(Y_i-\varvec{X}_{i}^\top \hat{\varvec{\beta }}<0)\right) =0. \end{aligned}$$

It is equivalent to $\hat{\varvec{\beta }}=(2\lambda n)^{-1}\mathbb {X}^\top \varvec{v}$, where $\varvec{v}{=}\left( v_1,\ldots ,v_n\right) ^{\top }$, $v_{i}=\tau -I(Y_i-\varvec{X}_{i}^\top \hat{\varvec{\beta }}<0)$ for $i=1,\ldots ,n$, $I(\cdot )$ is the indicator function. Hence $\hat{\varvec{\beta }}\in \mathcal {C}\left( \mathbb {X}^\top \right) $.

Part (ii). Define the set

$$\begin{aligned}{} & {} \mathcal {B}_{\alpha }=\left\{ \varvec{\beta }\in \mathbb {R}^{p_{n}}: \Vert \mathbb {X}(\varvec{\beta }-\varvec{\xi })\Vert _{\infty }\right. \\{} & {} \quad \left. = \alpha M_{1}, \Vert \varvec{\beta }-\varvec{\xi }\Vert _{2}=\alpha M_{2}, \varvec{\beta }\in \mathcal {C}(\mathbb {X}^\top ) \right\} , \end{aligned}$$

for some sufficiently large constant $\alpha $, where $M_{1}= n^{-2\omega -\kappa }/\sqrt{\log (n)}$, and $M_{2}=n^{1/2-3\omega /2-\kappa }/\sqrt{p_n\log (n)}$, and $\varvec{\xi }=\varvec{P}_{\mathbb {X}^\top }{\varvec{\beta }^*}$. Let $\mathcal {L}(\varvec{\beta }) = Q_{n}(\varvec{\beta })+\lambda \Vert \varvec{\beta }\Vert _{2}^2$. Notice the following decomposition:

$$\begin{aligned} \mathcal {L}(\varvec{\beta })-\mathcal {L}(\varvec{\xi })&=E\left[ Q_{n}(\varvec{\beta })-Q_{n}(\varvec{\xi })\right] +Q_{n}(\varvec{\beta })\\&-Q_{n}(\varvec{\xi })-E\left[ Q_{n}(\varvec{\beta })-Q_{n}(\varvec{\xi })\right] \\&+ \lambda \Vert \varvec{\beta }-\varvec{\xi }\Vert _{2}^2+2\lambda \varvec{\xi }^\top (\varvec{\beta }-\varvec{\xi })\\&\ge I_{n,1}+I_{n,2}+I_{n,3}, \end{aligned}$$

where $I_{n,1}= E\left[ Q_{n}(\varvec{\beta })-Q_{n}(\varvec{\xi })\right] $, $I_{n,2} = Q_{n}(\varvec{\beta })-Q_{n}(\varvec{\xi }) - E\left[ Q_{n}(\varvec{\beta })-Q_{n}(\varvec{\xi })\right] $, and $I_{n,3}=2\lambda \varvec{\xi }^\top (\varvec{\beta } -\varvec{\xi })$.

Note that

$$\begin{aligned} \varvec{X}_{i}^\top \varvec{\xi }=\varvec{1}_{i}^\top \mathbb {X}\mathbb {X}^\top (\mathbb {X}\mathbb {X}^\top )^{-1}\mathbb {X}\varvec{\beta }^*=\varvec{X}_{i}^\top \varvec{\beta }^*, \end{aligned}$$

(18)

where $\varvec{1}_{i}=(0,\ldots ,1,0,\ldots ,0)^\top $ be the i-th natural base in the n-dimensional Euclidean space. We set $a_{i}=\varvec{X}_{i}^\top (\varvec{\beta }-\varvec{\xi })$ for $i=1,\ldots ,n$. Then it is easy to derive that

$$\begin{aligned} \rho _{\tau }(Y_{i}-\varvec{X}_{i}^\top \varvec{\beta })-\rho _{\tau }(Y_{i}-\varvec{X}_{i}^\top \varvec{\xi })&=\rho _{\tau }(\epsilon _{i}-a_{i})-\rho _{\tau }(\epsilon _{i})\\&=(\epsilon _{i}-a_{i})\{\tau -I(\epsilon _{i}\le a_{i})\}\\&-\epsilon _{i}(\tau -I(\epsilon _{i}\le 0))\\&=-a_{i}\tau +a_{i}I(\epsilon _{i}\le a_{i})\\&-\left[ \epsilon _{i}I(\epsilon _{i}\le a_{i})-\epsilon _{i}I(\epsilon _{i}\le 0)\right] . \end{aligned}$$

By the definition of $\tau = E\left( I(\epsilon _{i} \le 0)\right) =F_{i}(0)$, mean value theorem, and the integration by parts $\int _{0}^{a_{i}}sf_{i}(s)\textrm{d}s=a_{i}F(a_{i})-\int _{0}^{a_{i}}F_{i}(s)\textrm{d}s$, combined with assumption A3, we have if $a_i>0$,

$$\begin{aligned} E\left[ \rho _{\tau }(Y_{i}-\varvec{X}_{i}^\top \varvec{\beta })-\rho _{\tau }(Y_{i}-\varvec{X}_{i}^\top \varvec{\xi })\right]&=-a_{i}F_{i}(0)+a_{i}F_{i}(a_{i})\\&-\int _{0}^{a_{i}}sf_{i}(s)\textrm{d}s\\&=\int _{0}^{a_{i}}\left[ F_{i}(s)-F_{i}(0)\right] \textrm{d}s\\&=\dfrac{1}{2}f_{i}(0)a_{i}^2+o(1)a_{i}^2, \end{aligned}$$

where o(1) is uniformly over all $i=1,\ldots ,n$. The same result can be obtained when $a_i <0$. Then

$$\begin{aligned} nE\left[ Q_{n}(\varvec{\beta })-Q_{n}(\varvec{\xi })\right]&=\sum _{i=1}^n \left[ \dfrac{1}{2}f_{i}(0)a_{i}^2+o(1)a_{i}^2\right] \\&\ge \dfrac{c}{2}\sum _{i=1}^na_{i}^2\\&=\dfrac{c}{2} (\varvec{\beta }-\varvec{\xi })^{\top } \mathbb {X}^{\top }\mathbb {X}(\varvec{\beta }-\varvec{\xi }). \end{aligned}$$

As $\varvec{\beta }-\varvec{\xi } \in \mathcal {C}(\mathbb {X}^\top )$, we let $\varvec{\beta }-\varvec{\xi } = \mathbb {X}^{\top }\varvec{\zeta }$ for some vector $\varvec{\zeta }$. Assume the singular value decomposition of $\mathbb {X}\mathbb {X}^{\top }$ be $\varvec{U}\varvec{D}\varvec{U}^{\top }$ with diagonal entries of $\varvec{D}$ arranged decreasingly and orthogonal $\varvec{U}$. Thus $(\varvec{\beta }-\varvec{\xi })^{\top } \mathbb {X}^{\top }\mathbb {X}(\varvec{\beta }-\varvec{\xi })= \varvec{\zeta }^{\top }\mathbb {X}\mathbb {X}^{\top }\mathbb {X}\mathbb {X}^{\top }\varvec{\zeta }= \varvec{\zeta }^{\top }\varvec{U}\varvec{D}^2\varvec{U}^{\top }\varvec{\zeta } \ge \lambda _{\min }(\mathbb {X}\mathbb {X}^\top )\varvec{\zeta }^{\top }\varvec{U}\varvec{D}\varvec{U}^{\top }\varvec{\zeta }= \lambda _{\min }(\mathbb {X}\mathbb {X}^\top )\Vert \varvec{\beta }-\varvec{\xi }\Vert _{2}^2$. Combined with the definition of $\mathcal {B}_{\alpha }$ and Lemma 1(e), it establishes that

$$\begin{aligned} I_{n,1}\ge \dfrac{c}{2}\lambda _{\min }(\mathbb {X}\mathbb {X}^\top )\Vert \varvec{\beta }-\varvec{\xi }\Vert _{2}^2 \ge \dfrac{c\alpha ^2 c_{5}}{2c_{1}c_{4}} p_n n^{-\omega -1}M_{2}^2 \end{aligned}$$

(19)

with probability going to 1, where c is the lower bound for $f_{i}(\cdot )$ of in the neighborhood for 0.

Define $\rho (Y,s)=\rho _{\tau }(Y-s)$ and omit the subscript $\tau $ for simplicity. Note that the following Lipschitz condition holds for $\rho (Y_i,\cdot )$,

$$\begin{aligned}{} & {} \left| \rho (Y_{i},s_{1})-\rho (Y_{i},s_{2})\right| \nonumber \\{} & {} \le \max \left\{ \tau , 1-\tau \right\} \left| s_{1}-s_{2}\right| \le \left| s_{1}-s_{2}\right| . \end{aligned}$$

(20)

By definition, $n^{-1}\sum _{i=1}^n\left| \varvec{X}_{i}^\top (\varvec{\beta }-\varvec{\xi })\right| ^2\le \alpha ^2M_{1}^2$ holds for any $\varvec{\beta }\in \mathcal {B}_{\alpha }$. Using (18), and (20), we have that

$$\begin{aligned} E\left| I_{n,2}\right| ^2&=E \left| \dfrac{1}{n}\sum _{i=1}^n \left[ \rho (Y_i,\varvec{X}_{i}^\top \varvec{\beta })- \rho (Y_i,\varvec{X}_{i}^\top \varvec{\xi })\right] \right. \\&\quad \left. -E\left[ \rho (Y_i,\varvec{X}_{i}^\top \varvec{\beta })- \rho (Y_i,\varvec{X}_{i}^\top \varvec{\xi })\right] \right| ^2\\&= \dfrac{1}{n^2} \sum _{i=1}^n E\left| \rho (Y_i,\varvec{X}_{i}^\top \varvec{\beta })- \rho (Y_i,\varvec{X}_{i}^\top \varvec{\xi })\right. \\&\left. -E\left[ \rho (Y_i,\varvec{X}_{i}^\top \varvec{\beta })- \rho (Y_i,\varvec{X}_{i}^\top \varvec{\xi })\right] \right| ^2\\&\le \alpha ^2M_{1}^2n^{-1}, \end{aligned}$$

which entails that

$$\begin{aligned} \left| I_{n,2}\right| =O_{P}(\alpha M_{1}n^{-1/2}). \end{aligned}$$

(21)

By assumption A2, we have

$$\begin{aligned} \Vert \varvec{\beta }^*\Vert _{2}^2\le \frac{ \{\varvec{\beta }^*\}^\top \varvec{\Sigma }\varvec{\beta }^*}{\lambda _{\min }(\varvec{\Sigma })} \le Cc_{4}c_{5}^{-1}n^{\omega }. \end{aligned}$$

Moreover, using Cauchy-Schwarz inequality, the term $|I_{n,3}|$ is bounded by

$$\begin{aligned}&\left| I_{n,3}\right| \le 2\lambda \Vert \varvec{\xi }\Vert _{2}\sup _{\mathcal {B}_{\alpha }}\Vert {\varvec{\beta }}-\varvec{\xi }\Vert _{2}\nonumber \\&\le 2\lambda \alpha M_{2} \Vert \varvec{\beta }^*\Vert _{2} \le 2Cc_{4}c_{5}^{-1}\lambda \alpha M_{2}n^{\omega }, \end{aligned}$$

(22)

with probability approaching one.

Combining the equation (19), (21), and (22), we have $P\left( \inf _{\mathcal {B}_{\alpha }}\left\{ \mathcal {L}(\varvec{\beta })-\mathcal {L}(\varvec{\xi })\right\} >0\right) \rightarrow 1$ as $n\rightarrow \infty $, $M_{1}= n^{-2\omega -\kappa }/\sqrt{\log (n)}$, $M_{2}=n^{1/2-3\omega /2-\kappa }/\sqrt{p_n\log (n)}$, and $n^{2\omega +\kappa }\sqrt{\log (n)}=o(n^{1/2})$ by assumption for $p_n$, which ensures $I_{n,3}$ and $I_{n,2}$ are dominated by $I_{n,1}$ for sufficient large $\alpha $. By the convexity of $\mathcal {L}(\varvec{\beta })-\mathcal {L}(\varvec{\xi })$ and the fact that $\mathcal {L}(\hat{\varvec{\beta }})\le \mathcal {L}(\varvec{\xi })$, we have

$$\begin{aligned}{} & {} P\left( \Vert \mathbb {X}(\hat{\varvec{\beta }}-\varvec{\xi })\Vert _{\infty }=O\left( n^{-2\omega -\kappa }/\sqrt{\log (n)}\right) , \right. \\{} & {} \quad \left. \Vert \hat{\varvec{\beta }}-\varvec{\xi }\Vert _{2}=O\left( n^{1/2-3\omega /2-\kappa }/\sqrt{p_n\log (n)}\right) \right) \rightarrow 1. \end{aligned}$$

Part (iii). By simple algebra calculation, we have that

$$\begin{aligned}&\mathbb {X}^{\top } (\mathbb {X}\mathbb {X}^{\top }+\lambda \mathbb {I}_n)^{-1} \mathbb {X}\varvec{\beta }^*-\varvec{\xi }\\&\quad = \mathbb {X}^{\top } (\mathbb {X}\mathbb {X}^{\top })^{-1/2}(\mathbb {I}_n+\lambda (\mathbb {X}\mathbb {X}^{\top })^{-1})^{-1} (\mathbb {X}\mathbb {X}^{\top })^{-1/2}\mathbb {X}\varvec{\beta }^*-\varvec{\xi }\\&\quad =\sum _{k=1}^{\infty } \mathbb {X}^{\top } (\mathbb {X}\mathbb {X}^{\top })^{-1/2}\{-\lambda (\mathbb {X}\mathbb {X}^{\top })^{-1}\}^k (\mathbb {X}\mathbb {X}^{\top })^{-1/2}\mathbb {X}\varvec{\beta }^*, \end{aligned}$$

which, combined with Hölder inequality, yields that

$$\begin{aligned}&\Vert \mathbb {X}^{\top } (\mathbb {X}\mathbb {X}^{\top }+\lambda \mathbb {I}_n)^{-1} \mathbb {X}\varvec{\beta }^*-\varvec{\xi }\Vert _2\\&\quad \le \sum _{k=1}^{\infty } \Vert \mathbb {X}^{\top } (\mathbb {X}\mathbb {X}^{\top })^{-1/2}\{-\lambda (\mathbb {X}\mathbb {X}^{\top })^{-1}\}^k (\mathbb {X}\mathbb {X}^{\top })^{-1/2}\mathbb {X}\varvec{\beta }^*\Vert _2\\&\quad \le \sum _{k=1}^{\infty } \Vert \lambda (\mathbb {X}\mathbb {X}^{\top })^{-1}\Vert ^k \cdot \Vert \varvec{\beta }^*\Vert _2. \end{aligned}$$

This and conditions A1–A2 guarantees that

$$\begin{aligned}&\Vert \mathbb {X}^{\top } (\mathbb {X}\mathbb {X}^{\top }+\lambda \mathbb {I}_n)^{-1} \mathbb {X}\varvec{\beta }^*-\varvec{\xi }\Vert _2\\&\quad \le \frac{C}{\lambda _{\min }(\varvec{\Sigma })} \sum _{k=1}^{\infty }\{\lambda /\lambda _{\min }(\mathbb {X}\mathbb {X}^{\top })\}^{k} \\&\quad =O(n^{\omega })\frac{\lambda /\lambda _{\min }(\mathbb {X}\mathbb {X}^{\top })}{1-\lambda /\lambda _{\min }(\mathbb {X}\mathbb {X}^{\top })}\\&\quad =O_{P}(\lambda n^{2\omega }p_n^{-1}), \end{aligned}$$

which, together with (17), established result (iii). $\square $

Appendix C. Proof of Theorem 1

Applying $\mathbb {P}_{\mathbb {X}^\top }\left( \hat{\varvec{\beta }}-\varvec{\xi }\right) =\hat{\varvec{\beta }}-\varvec{\xi }$ and Cauchy-Schwarz inequality, we obtain that

$$\begin{aligned}&\Vert \hat{\varvec{\beta }}-\varvec{\xi }\Vert _{\infty }\\&\quad \le \max _{1\le i\le p_n}\left| \varvec{e}_{i}^\top \mathbb {P}_{\mathbb {X}^\top }\left( \hat{\varvec{\beta }}-\varvec{\xi }\right) \right| \\&\quad \le \min \left\{ \Vert \hat{\varvec{\beta }}-\varvec{\xi }\Vert _{2}\max _{1\le i\le p_n}\Vert \mathbb {P}_{\mathbb {X}^\top }\varvec{e}_{i}\Vert _{2},\ \Vert \mathbb {X}(\hat{\varvec{\beta }}-\varvec{\xi })\Vert _{2}\right. \\&\qquad \left. \max _{1\le i\le p_n}\Vert (\mathbb {X}\mathbb {X}^\top )^{-1}\mathbb {X}\varvec{e}_{i}\Vert _{2} \right\} \\&\quad \le \min \left\{ \Vert \hat{\varvec{\beta }}-\varvec{\xi }\Vert _{2}\max _{1\le i\le p_n}\Vert \mathbb {P}_{\mathbb {X}^\top }\varvec{e}_{i}\Vert _{2},\ n^{1/2}\Vert \mathbb {X}(\hat{\varvec{\beta }}-\varvec{\xi })\Vert _{\infty }\right. \\&\left. \max _{1\le i\le p_n}\Vert (\mathbb {X}\mathbb {X}^\top )^{-1}\mathbb {X}\varvec{e}_{i}\Vert _{2} \right\} . \end{aligned}$$

Using Lemma 1(a), (b) and Bonferroni inequality, we obtain that

$$\begin{aligned}&P\left( \max _{1\le i\le p_n}\Vert (\mathbb {X}\mathbb {X}^\top )^{-1}\mathbb {X}\varvec{e}_{i}\Vert _{2}^2>c_{1}c_{2}'n^{1+2\omega }p_n^{-2}\right) \\&\quad \le \sum _{i=1}^{p_n}P\left( \Vert (\mathbb {X}\mathbb {X}^\top )^{-1}\mathbb {X}\varvec{e}_{i}\Vert _{2}^2>c_{1}c_{2}'n^{1+2\omega }p_n^{-2}\right) \\&\quad \le 3\exp \left( \log (p_n)-C_{1}n\right) , \end{aligned}$$

and

$$\begin{aligned}&P\left( \max _{1\le i\le p_n}\Vert \mathbb {P}_{\mathbb {X}^\top }\varvec{e}_{i}\Vert _{2}^2>c_{2}'n^{1+\omega }p_n^{-1}\right) \\&\quad \le \sum _{i=1}^{p_n}P\left( \Vert \mathbb {P}_{\mathbb {X}^\top }\varvec{e}_{i}\Vert _{2}^2>c_{2}'n^{1+\omega }p_n^{-1}\right) \\&\quad \le 4\exp \left( \log (p_n)-C_{1}n\right) . \end{aligned}$$

This together with assumption for $p_n$, i.e., $\log (p_n) = o({n^{1-5\omega -2\kappa -v}}/{\log (n)})$, and result (ii) yields that

$$\begin{aligned} \Vert \hat{\varvec{\beta }}-\varvec{\xi }\Vert _{\infty }=O_{P}\left( \frac{n^{1-\omega -\kappa }}{p_n\sqrt{\log (n)}}\right) . \end{aligned}$$

Then by Lemma 1(c) and (d),

$$\begin{aligned}&\min _{j\in \mathcal {S}^*}|\hat{\beta }_{j}|-\max _{j\notin \mathcal {S}^*}|\hat{\beta }_{j}|\\&\quad =\min _{j\in \mathcal {S}^*}\left| \hat{\beta }_{j}-\xi _{j}+\xi _{j}\right| -\max _{j\notin \mathcal {S}^*}\left| \hat{\beta }_{j}-\xi _{j}+\xi _{j}\right| \\&\quad \ge \min _{j\in \mathcal {S}^*}|\xi _{j}|-\max _{j\notin \mathcal {S}^*}|\xi _{j}|-2\Vert \hat{\varvec{\beta }}-\varvec{\xi }\Vert _{\infty }\\&\quad \ge \frac{cn^{1-\omega -\kappa }}{p_{n}}\left( 1+o_{P}(1)\right) -O_{P}\left( \frac{n^{1-\omega -\kappa }}{p_n\sqrt{\log (n)}}\right) \\&\quad \ge \frac{cn^{1-\omega -\kappa }}{p_{n}}\left( \dfrac{1}{2}+o_{P}(1)\right) . \end{aligned}$$

This completes proof. $\square $

Appendix D. Proof of Theorem 2

Recall the QRR screening indexes set $\mathcal {F}_d=\{i_1,i_2,\cdots ,i_d\}$ and denote $\mathcal {S}_k=\{i_1,\cdots ,i_k\}$ for $k=1,\ldots ,d$. By Theorem 1, we have $P(\mathcal {S}_{s_n} = \mathcal {S}^*)=1$, where $s_n$ is defined in Sect. 3. Given $\mathcal {S}_{k-1}$, now we consider the likelihood-based statistic

$$\begin{aligned} L(\mathcal {S}_{k})=\sum _{i=1}^n\left\{ \rho _{\tau }(Y_{i}-\varvec{X}_{i,\mathcal {S}_{k-1}}^\top \hat{\varvec{\beta }}_{\mathcal {S}_{k-1}})- \rho _{\tau }(Y_{i}-\varvec{X}_{i,\mathcal {S}_{k}}^\top \hat{\varvec{\beta }}_{\mathcal {S}_{k}})\right\} , \end{aligned}$$

(23)

where $k=2,\ldots ,d$.

For $|\mathcal {S}_{k-1}|<s_n$, it means that $\mathcal {S}^* \not \subset \mathcal {S}_{k-1}$. We next prove that the k-th shortlisted index $i_k$ should be selected with probability approaching one as the statistic $L(\mathcal {S}_k) > C k\log (n)\log (p_n)$. As well defined $Q_{n}(\varvec{\beta })=n^{-1}\sum _{i=1}^n\rho _\tau (Y_i-\varvec{X}_i^\top \varvec{\beta })$, and the pseudo true coefficient $\tilde{\varvec{\beta }}_{\mathcal {S}}$ on the support of the model $\mathcal {S}$. For $\forall \mathcal {M}_1=\mathcal {S}_{k-1} \cup \{j\}$, $j \in \mathcal {S}_{k-1}^{c}\cap \mathcal {S}^*$, we decompose

$$\begin{aligned} \begin{aligned}&Q_n(\hat{\varvec{\beta }}_{\mathcal {S}_{k-1}}) - Q_n(\hat{\varvec{\beta }}_{\mathcal {M}_1}) \\&\quad = \left\{ Q_n(\hat{\varvec{\beta }}_{\mathcal {S}_{k-1}}) - Q_n(\hat{\varvec{\beta }}_{\mathcal {M}_1})\right. \\&\qquad -\left. E\left[ Q_n(\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}}) - Q_n(\tilde{\varvec{\beta }}_{\mathcal {M}_1})\right] \right\} \\&\qquad +\left\{ E\left[ Q_n(\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}})\right] - E\left[ Q_n(\tilde{\varvec{\beta }}_{\mathcal {M}_1})\right] \right\} \\&\quad =I_1 + I_2. \end{aligned} \end{aligned}$$

By triangular inequality, we note that

$$\begin{aligned} \begin{aligned} |I_1|&=\left| Q_n(\hat{\varvec{\beta }}_{\mathcal {S}_{k-1}}) - Q_n(\hat{\varvec{\beta }}_{\mathcal {M}_1}) - E\left[ Q_n(\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}}) - Q_n(\tilde{\varvec{\beta }}_{\mathcal {M}_1})\right] \right| \\&=\left| Q_n(\hat{\varvec{\beta }}_{\mathcal {S}_{k-1}}) - Q_n(\hat{\varvec{\beta }}_{\mathcal {M}_1}) - \left[ Q_n(\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}}) - Q_n(\tilde{\varvec{\beta }}_{\mathcal {M}_1})\right] \right. \\&\left. \quad + \left[ Q_n(\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}}) - Q_n(\tilde{\varvec{\beta }}_{\mathcal {M}_1})\right] - E\left[ Q_n(\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}}) - Q_n(\tilde{\varvec{\beta }}_{\mathcal {M}_1})\right] \right| \\&\le \left| Q_n(\hat{\varvec{\beta }}_{\mathcal {S}_{k-1}}) - Q_n(\hat{\varvec{\beta }}_{\mathcal {M}_1}) - \left[ Q_n(\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}}) - Q_n(\tilde{\varvec{\beta }}_{\mathcal {M}_1})\right] \right| \\&\quad \left. + \left| \left[ Q_n(\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}}) - Q_n(\tilde{\varvec{\beta }}_{\mathcal {M}_1})\right] - E\left[ Q_n(\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}}) - Q_n(\tilde{\varvec{\beta }}_{\mathcal {M}_1})\right] \right| \right. \\&\le \left| Q_n(\hat{\varvec{\beta }}_{\mathcal {S}_{k-1}}) - Q_n(\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}})\right| + \left| Q_n(\hat{\varvec{\beta }}_{\mathcal {M}_1}) - Q_n(\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}})\right| \\&\quad + \left| \left[ Q_n(\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}}) - Q_n(\tilde{\varvec{\beta }}_{\mathcal {M}_1})\right] - E\left[ Q_n(\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}}) - Q_n(\tilde{\varvec{\beta }}_{\mathcal {M}_1})\right] \right| \\&= |I_{11}| + |I_{12}| + |I_{13}|. \end{aligned} \end{aligned}$$

For $I_{11}$, by combining Lemma 2, Lipschitz condition in (20), and assumption A4,

$$\begin{aligned} \begin{aligned} |I_{11}|&\le \dfrac{1}{n}\sum _{i=1}^{n} \left| \varvec{X}^{\top }_{i,\mathcal {S}_{k-1}}\left( {\tilde{\varvec{\beta }}}_{\mathcal {S}_{k-1}}-\hat{\varvec{\beta }}_{\mathcal {S}_{k-1}}\right) \right| \\&\le \dfrac{1}{n}\sum _{i=1}^{n} \Vert \varvec{X}_{i,\mathcal {S}_{k-1}}\Vert _{2} \Vert {\tilde{\varvec{\beta }}}_{\mathcal {S}_{k-1}}-\hat{\varvec{\beta }}_{\mathcal {S}_{k-1}}\Vert _{2}\\&\le C_{11} \sqrt{|\mathcal {S}_{k-1}|} \Vert {\tilde{\varvec{\beta }}}_{\mathcal {S}_{k-1}}-\hat{\varvec{\beta }}_{\mathcal {S}_{k-1}}\Vert _{2}\\&= O_p\left( |\mathcal {S}_{k-1}|\sqrt{\dfrac{\log (n)\log (p_n)}{n}}\right) . \end{aligned} \end{aligned}$$

(24)

Similarly, we have

$$\begin{aligned} |I_{12}| = O_p\left( |\mathcal {M}_{1}|\sqrt{\dfrac{\log (n)\log (p_n)}{n}}\right) . \end{aligned}$$

(25)

For $I_{13}$, by Lipschitz condition in (20) and assumption A4,

$$\begin{aligned} \left| Q_n(\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}}) - Q_n(\tilde{\varvec{\beta }}_{\mathcal {M}_1})\right| \le C_{13}\sqrt{|\mathcal {M}_1|} \Vert {\tilde{\varvec{\beta }}}_{\mathcal {S}_{k-1}}-\tilde{\varvec{\beta }}_{\mathcal {M}_{1}}\Vert _{2}. \end{aligned}$$

Noted that $\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}}$ is in $\mathbb {R}^{|\mathcal {S}_{k-1}|}$ while $\tilde{\varvec{\beta }}_{\mathcal {M}_1}$ belongs to $\mathbb {R}^{|\mathcal {M}_{1}|}$. When we discuss $\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}}-\tilde{\varvec{\beta }}_{\mathcal {M}_1}$, since $\mathcal {S}_{k-1}\subset \mathcal {M}_{1}$, it defaults to add a coefficient $\tilde{\beta }_j = 0$ to $\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}}$ to align two vectors.

Then by Hoeffding’s inequality, we have for any $t>0$,

$$\begin{aligned}{} & {} P\left( \left| \left[ Q_n(\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}}) - Q_n(\tilde{\varvec{\beta }}_{\mathcal {M}_1})\right] - E\left[ Q_n(\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}}) - Q_n(\tilde{\varvec{\beta }}_{\mathcal {M}_1})\right] \right| >t\right) \\{} & {} \quad \le 2 \exp \left( -\dfrac{2nt^2}{C_{13}\sqrt{|\mathcal {M}_1|} \Vert {\tilde{\varvec{\beta }}}_{\mathcal {S}_{k-1}}-\tilde{\varvec{\beta }}_{\mathcal {M}_{1}}\Vert _{2}}\right) . \end{aligned}$$

Taking $t=|\mathcal {M}_{1}|\sqrt{{\log (n)\log (p_n)}/{n}}$, we have

$$\begin{aligned}{} & {} P\left( \left| I_{13}\right| >|\mathcal {M}_{1}|\sqrt{\dfrac{\log (n)\log (p_n)}{n}}\right) \\{} & {} \quad \le 2 \exp \left( -\dfrac{2|\mathcal {M}_{1}|^{3/2}\log (n)\log (p_n)}{C_{13} \Vert {\tilde{\varvec{\beta }}}_{\mathcal {S}_{k-1}}-\tilde{\varvec{\beta }}_{\mathcal {M}_{1}}\Vert _{2}}\right) \rightarrow 0. \end{aligned}$$

Moreover, using Boole’s inequality, for $\forall \mathcal {M}_1=\mathcal {S}_{k-1} \cup \{j\}$, $j \in \mathcal {S}_{k-1}^{c}\cap \mathcal {S}^*$ we have

$$\begin{aligned} \begin{aligned}&P\left( \sup _{2\le |\mathcal {M}_1|\le d} \left| I_{13}\right| >|\mathcal {M}_{1}|\sqrt{\dfrac{\log (n)\log (p_n)}{n}}\right) \\&\quad \le C_{13}'\sum _{2\le |\mathcal {M}_1|\le d} \left( {\begin{array}{c}p_n\\ |\mathcal {M}_1|\end{array}}\right) \exp \left( -{|\mathcal {M}_1|^{3/2}\log (n)\log (p_n)}\right) \\&\quad \le C_{13}'\sum _{k=1}^{d}\left( \frac{p_n e}{k}\right) ^{k} \exp \left( -{k^{3/2}\log (n)\log (p_n)}\right) \rightarrow 0.\\ \end{aligned} \end{aligned}$$

(26)

Denote $\gamma _n = \sqrt{k\log (n)\log (p_n)/n}$. Therefore, combined with (24), (25), and (26) yields $|I_1|=o_p(\gamma _n^2)$ uniformly for all $\mathcal {M}_1$.

For $I_2$, we utilize the difference between $\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}}$ and $\tilde{\varvec{\beta }}_{\mathcal {M}_1}$ to derive its lower bound. This is to reduce the redundancy of symbols. Employing Knight’s identity, $u_i=Y_i - \varvec{X}^{\top }_{i,\mathcal {M}_1}\tilde{\varvec{\beta }}_{\mathcal {M}_1}$, $v_i=\varvec{X}^{\top }_{i,\mathcal {M}_1}(\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}}-\tilde{\varvec{\beta }}_{\mathcal {M}_1})$, we have,

$$\begin{aligned} \begin{aligned} I_{2}&=(\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}}-\tilde{\varvec{\beta }}_{\mathcal {M}_1})^{\top }\\&E\left[ -\varvec{X}^{\top }_{i,\mathcal {M}_1}\left( \tau -I(Y_i - \varvec{X}^{\top }_{i,\mathcal {M}_2}\tilde{\varvec{\beta }}_{\mathcal {M}_1}<0)\right) \right] \\&\qquad +E\left\{ \int _{0}^{\varvec{X}^{\top }_{i,\mathcal {M}}(\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}}-\tilde{\varvec{\beta }}_{\mathcal {M}_1})}\right. \\&\left. \left[ I(Y_i - \varvec{X}^{\top }_{i,\mathcal {M}_1}\tilde{\varvec{\beta }}_{\mathcal {M}_1}\le s)-I(Y_i - \varvec{X}^{\top }_{i,\mathcal {M}_1}\tilde{\varvec{\beta }}_{\mathcal {M}_1}\le 0)\right] \textrm{d}s\right\} \\&= E\left\{ \int _{0}^{\varvec{X}^{\top }_{i,\mathcal {M}_1}(\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}}-\tilde{\varvec{\beta }}_{\mathcal {M}_1})}\right. \\&\left. \left[ I(Y_i - \varvec{X}^{\top }_{i,\mathcal {M}_1}\tilde{\varvec{\beta }}_{\mathcal {M}_1}\le s)-I(Y_i - \varvec{X}^{\top }_{i,\mathcal {M}_1}\tilde{\varvec{\beta }}_{\mathcal {M}_1}\le 0)\right] \textrm{d}s\right\} \\&\ge 0.5\underline{f}\Vert \tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}}-\tilde{\varvec{\beta }}_{\mathcal {M}_1} \Vert _{2}^2\\&\ge 0.5\underline{f} \tilde{b}_j^2, \end{aligned} \end{aligned}$$

where $\tilde{b}_j$ is the pseudo true coefficient for variable j in $\tilde{\varvec{\beta }}_{\mathcal {M}_1}$. Then we consider a vector of coefficient $\breve{{\varvec{\beta }}}_{\mathcal {M}_1}$, in which the coefficient for variable j is 0, the coefficients of the other indexes are the same as the $\tilde{{\varvec{\beta }}}_{\mathcal {M}_1}$’s. Now by condition A4,

$$\begin{aligned} \begin{aligned}&\left| E\left[ X_{j}\psi _{\tau }(Y-\varvec{X}^{\top }_{\mathcal {M}_1}\tilde{{\varvec{\beta }}}_{\mathcal {M}_1}) - X_{j}\psi _{\tau }(Y-\varvec{X}^{\top }_{\mathcal {M}_1}\breve{{\varvec{\beta }}}_{\mathcal {M}_1})\right] \right| \\&\quad = \left| E\left[ X_{j}\left( I(Y\le \varvec{X}^{\top }_{\mathcal {M}_1}\breve{{\varvec{\beta }}}_{\mathcal {M}_1})-I(Y\le \varvec{X}^{\top }_{\mathcal {M}_1}\tilde{{\varvec{\beta }}}_{\mathcal {M}_1})\right) \right] \right| \\&\quad \le \underline{f} |\tilde{b}_j|. \end{aligned} \end{aligned}$$

Thus we have $I_2\ge \gamma _{l}^2/(2\underline{f})$ by $|E[X_{j}\psi _{\tau }(Y-\varvec{X}^{\top }_{\mathcal {M}_1}\tilde{{\varvec{\beta }}}_{\mathcal {M}_1})]|>\gamma _{l}$ in assumption A6. Here, $\gamma _n\le \gamma _{l}$. Therefore, for $\forall \mathcal {M}_1=\mathcal {S}_{k-1} \cup \{j\}$, $j \in \mathcal {S}_{k-1}^{c}\cap \mathcal {S}^*$ and some constant $C_{M_1}$,

$$\begin{aligned}&{} P\left( Q_n(\hat{\varvec{\beta }}_{\mathcal {S}_{k-1}}) - Q_n(\hat{\varvec{\beta }}_{\mathcal {S}_{k}}) \le C_{M_2}k\log (n)\log (p_n)\right) \nonumber \\{}&\quad \ge P\left( \max _{\mathcal {M}_2}Q_n(\hat{\varvec{\beta }}_{\mathcal {S}_{k-1}}) - Q_n(\hat{\varvec{\beta }}_{\mathcal {M}_2})\le C_{M_2}n\gamma _n^2\right) \rightarrow 1.\nonumber \\ \end{aligned}$$

(27)

For $|\mathcal {S}_{k-1}|\ge s_n$, by Theorem 1, it means that $P(\mathcal {S}_{k-1}\supset \mathcal {S}^*)\rightarrow 1$. Here we prove that the k-th shortlisted index $i_k$ will be discarded as the statistic $L(\mathcal {S}_k) < C k\log (n)\log (p_n)$ with probability approaching one. Consider $\forall \mathcal {M}_2$ satisfying $|\mathcal {M}_2|=k$, and $\mathcal {S}^*\subset \mathcal {M}_2$, then $Q_n({\varvec{\beta }}_{\mathcal {S}_{k-1}}^*) = Q_n({\varvec{\beta }}_{\mathcal {M}_{2}}^*)$. Thus we can decompose

$$\begin{aligned} \begin{aligned}&\left| Q_n(\hat{\varvec{\beta }}_{\mathcal {S}_{k-1}}) - Q_n(\hat{\varvec{\beta }}_{\mathcal {M}_2})\right| \\&\quad = \left| \left[ Q_n(\hat{\varvec{\beta }}_{\mathcal {S}_{k-1}}) - Q_n({\varvec{\beta }}_{\mathcal {S}_{k-1}}^*)\right] \right. \\&\left. - \left[ Q_n(\hat{\varvec{\beta }}_{\mathcal {M}_2}) - Q_n({\varvec{\beta }}_{\mathcal {M}_{2}}^*)\right] \right| \\&\quad \le \left| Q_n(\hat{\varvec{\beta }}_{\mathcal {S}_{k-1}}) - Q_n({\varvec{\beta }}_{\mathcal {S}_{k-1}}^*)\right| \\&+ \left| Q_n(\hat{\varvec{\beta }}_{\mathcal {M}_2}) - Q_n({\varvec{\beta }}_{\mathcal {M}_{2}}^*)\right| \\&\quad =|I_3| + |I_4|. \end{aligned} \end{aligned}$$

The inequality holds by triangular inequality. Similar to the argument of (24), by Lemma 3, we have

$$\begin{aligned}{} & {} |I_3|\le \dfrac{1}{n}\sum _{i=1}^{n} \left| \varvec{X}^{\top }_{i,\mathcal {S}_{k-1}}\left( {{\varvec{\beta }}}_{\mathcal {S}_{k-1}}^*-\hat{\varvec{\beta }}_{\mathcal {S}_{k-1}}\right) \right| \\{} & {} \quad =O_p\left( |\mathcal {S}_{k-1}|\sqrt{\dfrac{\log (n)\log (p_n)}{n}}\right) . \end{aligned}$$

Similarly, we have

$$\begin{aligned} |I_4|=O_p\left( |\mathcal {M}_{2}|\sqrt{\dfrac{\log (n)\log (p_n)}{n}}\right) . \end{aligned}$$

Therefore, for $\forall \mathcal {M}_2 = \mathcal {S}_{k-1} \cup \{j\}$, $j \in \mathcal {F}_p{\setminus } \mathcal {S}^*$ and some constant $C_{M_2}$,

$$\begin{aligned}{} & {} P\left( Q_n(\hat{\varvec{\beta }}_{\mathcal {S}_{k-1}}) - Q_n(\hat{\varvec{\beta }}_{\mathcal {S}_{k}}) \le C_{M_2}k\log (n)\log (p_n)\right) \nonumber \\{} & {} \quad \ge P\left( \max _{\mathcal {M}_2}Q_n(\hat{\varvec{\beta }}_{\mathcal {S}_{k-1}}) - Q_n(\hat{\varvec{\beta }}_{\mathcal {M}_2})\le C_{M_2}n\gamma _n^2\right) \rightarrow 1.\nonumber \\ \end{aligned}$$

(28)

One can take the union threshold constant C by $C_{M_2}<C<C_{M_1}$. Combined (27) with (28) leads to $P(\hat{\mathcal {S}}_{V}=\mathcal {S}^*)=1$. The proof is completed. In practice, we suggest to choose a conservative $C\in (0,1)$ since $C_{M_2}$ tends to zero. $\square $

Appendix E. Proof of Theorem 3

The proposed sequential LIPS is substantially based on the framework of forward update and backward deletion. It adds the internal competition to connect the above two phases. Correspondingly, the proof can be divided into two parts: (i) to show that $P(\hat{\mathcal {S}}_{V}=\mathcal {S}^*)=1$; (ii) to show that $P(\hat{\mathcal {S}}_{S}=\hat{\mathcal {S}}_{V})=1$.

Table 9 Sure screening rate $P_{\mathcal {S}}$ (%) of five methods under model size $d_n=n-1$ for three scenarios in Section 4.2. ($n=200,p_n=5000$)

Full size table

Part (i). It follows the results in Theorem 2.

Part (ii). Denote the current selected index set as $\hat{\mathcal {S}}_{T}$. In the initial step, we set $\hat{\mathcal {S}}_{T} = \hat{\mathcal {S}}_{V}$.

If $\mathcal {S}^*\not \subset \hat{\mathcal {S}}_{T}$, without loss of generality, we let $k_1$ be an index satisfying $k_1 \in \hat{\mathcal {S}}_{T}^{c} \cap \mathcal {S}^*$. For $\mathcal {M}_1=\hat{\mathcal {S}}_{T} \cup \{k_1\}$, we consider

$$\begin{aligned} L(\mathcal {M}_{1})=\sum _{i=1}^n\left\{ \rho _{\tau }(Y_{i}-\varvec{X}_{i,\hat{\mathcal {S}}_{T}}^\top \hat{\varvec{\beta }}_{\hat{\mathcal {S}}_{T}})- \rho _{\tau }(Y_{i}-\varvec{X}_{i,\mathcal {M}_{1}}^\top \hat{\varvec{\beta }}_{\mathcal {M}_{1}})\right\} . \end{aligned}$$

By (27) of Theorem 2, we have

$$\begin{aligned} P\left( L(\mathcal {M}_{1})\ge C (|\hat{\mathcal {S}}_{T}|+1)\log (n)\log (p_n)\right) \rightarrow 1. \end{aligned}$$

It means that the omission of active features can be reselected.

If $\mathcal {S}^*\subset \hat{\mathcal {S}}_{T}$, without loss of generality, we assume $\hat{\mathcal {S}}_{T} \cap \mathcal {S}^{*c} \ne \varnothing $. Now we investigate $k_2\in \hat{\mathcal {S}}_{T} \cap \mathcal {S}^{*c}$. Let $\mathcal {M}_2=\hat{\mathcal {S}}_{T} {\setminus } \{k_2\}$, then $\mathcal {S}^{*} \subset \mathcal {M}_2$. As for

$$\begin{aligned} L(\mathcal {M}_{2})=\sum _{i=1}^n\left\{ \rho _{\tau }(Y_{i}-\varvec{X}_{i,{\mathcal {M}}_{2}}^\top \hat{\varvec{\beta }}_{{\mathcal {M}}_{2}})- \rho _{\tau }(Y_{i}-\varvec{X}_{i,\hat{\mathcal {S}}_{T}}^\top \hat{\varvec{\beta }}_{\hat{\mathcal {S}}_{T}})\right\} , \end{aligned}$$

by (28) of Theorem 2, we have that $k_1$ will not be retained due to

$$\begin{aligned} P\left( L(\mathcal {M}_{2})\ge C |\hat{\mathcal {S}}_{T}|\log (n)\log (p_n)\right) \rightarrow 0. \end{aligned}$$

It means that the inactive features can be removed.

An error situation can be regarded as that some spurious variables, which are highly correlated to error term, may take precedence over the truly active ones and prevents these variables from being selected. When this occurs, $k_1$ will not be reselected since $L(\mathcal {M}_{1})\ge C (|\hat{\mathcal {S}}_{T}|+1)\log (n)\log (p_n)$. However, we use internal competition to let $k_1$ join in $\hat{\mathcal {S}}_{T}$ temporarily to clear spurious and redundant variables. Therefore, we have $P(\hat{\mathcal {S}}_{S}=\hat{\mathcal {S}}_{V})=1$.

Combined with Part (i), and (ii) leads to $P(\hat{\mathcal {S}}_{S}=\mathcal {S}^*)=1$. This completes the proof of Theorem 3. $\square $

Appendix F. Addition simulations

1.1 Example 1: Sure screening rate for $n=200$, $p_n=5000$ when model size $d_n=n-1$

See Table 9.

1.2 Example 2: Selection performance when $n=300$ at $\tau =0.5$

See Tables 10, 11, 12.

Table 10 Average QPE, FN, FP, and the algorithm’s running time over 500 replications for BDAR-1 mode under quantile level $\tau =0.5$ when $n=300$ (values in the parentheses represent the corresponding standard deviations)

Full size table

Table 11 Average QPE, FN, FP, and the algorithm’s running time over 500 replications for BDCS-1 mode under quantile level $\tau =0.5$ when $n=300$ (values in the parentheses represent the corresponding standard deviations)

Full size table

Table 12 Average QPE, FN, FP, and the algorithm’s running time over 500 replications for CB-1 mode under quantile level $\tau =0.5$ when $n=300$ (values in the parentheses represent the corresponding standard deviations)

Full size table

Table 13 Average QPE, FN, FP, and the algorithm’s running time over 500 replications for BDAR-1 mode under quantile level $\tau =0.8$ when $n=100$ (values in the parentheses represent the corresponding standard deviations)

Full size table

1.3 Example 3: Selection performance when $n=100$ at $\tau =0.8$

See Tables 13, 14 and 15.

Table 14 Average QPE, FN, FP, and the algorithm’s running time over 500 replications for BDCS-1 mode under quantile level $\tau =0.8$ when $n=100$ (values in the parentheses represent the corresponding standard deviations)

Full size table

Table 15 Average QPE, FN, FP, and the algorithm’s running time over 500 replications for CB-1 mode under quantile level $\tau =0.8$ when $n=100$ (values in the parentheses represent the corresponding standard deviations)

Full size table

Table 16 Average QPE, FN, FP, and the algorithm’s running time over 500 replications for BDAR-1 mode under quantile level $\tau =0.8$ when $n=300$ (values in the parentheses represent the corresponding standard deviations)

Full size table

Table 17 Average QPE, FN, FP, and the algorithm’s running time over 500 replications for BDCS-1 mode under quantile level $\tau =0.8$ when $n=300$ (values in the parentheses represent the corresponding standard deviations)

Full size table

1.4 Example 4: Selection performance when $n=300$ at $\tau =0.8$

See Tables 16, 17 and 18.

Table 18 Average QPE, FN, FP, and the algorithm’s running time over 500 replications for CB-1 mode under quantile level $\tau =0.8$ when $n=300$ (values in the parentheses represent the corresponding standard deviations)

Full size table

Table 19 Average QPE, FN, FP, and the algorithm’s running time over 500 replications for BDAR-3 mode under quantile level $\tau =0.5$ (values in the parentheses represent the corresponding standard deviations)

Full size table

1.5 Example 5: Weak correlations

In this example, we continue our examination of the two scenarios from Section 4.2; however, the correlation among predictors in the data generation process is notably weak. We examine the quantile level $\tau =0.5$ for each scenario. The comparative performance of different methods is evaluated through the average Quantile Prediction Error (QPE), False Negatives (FN), False Positives (FP), and the average running time of the algorithm over 500 replications. The results are shown in Tables 19 and 20.

Table 20 Average QPE, FN, FP, and the algorithm’s running time over 500 replications for BDCS-3 mode under quantile level $\tau =0.5$ (values in the parentheses represent the corresponding standard deviations)

Full size table

Table 21 Average QPE, FN, FP, and the algorithm’s running time over 500 replications for time series data (values in the parentheses represent the corresponding standard deviations)

Full size table

$\bullet $ Block Diagonal Auto-Regressive correlation (BDAR):

Mode 3
(denoted by BDAR-$3_{0.5}$): In this mode, the covariance matrix $\varvec{\Sigma } = (\sigma _{ij})$, where $\sigma _{ij}=0.2^{|i-j|}$, $1\le i,j\le p_n$. Non-zero coefficients of $\varvec{\beta }^*$ are set to be $\beta _1^*=\sqrt{8}$, $ \beta _3^*=\sqrt{2}$, $\beta _6^*=\sqrt{3}$, and $\beta _{10}^*=\sqrt{5}$.

$\bullet $ Block Diagonal Compound Symmetry (BDCS):

Mode 3
(denoted by BDCS-$3_{0.5}$): In this mode, the covariance matrix $\varvec{\Sigma }$ has diagonal element 1 and off-diagonal element 0.2. Non-zero coefficients of $\varvec{\beta }^*$ are set to be $\beta _{1}^*=\beta _{2}^*=\beta _{3}^*=\sqrt{6}$.

1.6 Example 6: Time series

In this example, we use the model (7), $n=100$, $p_n=1000$, to generate data. We assume that the predictors are generated from the process $\varvec{X}_i = A_1\varvec{X}_{i-1} + A_2\varvec{X}_{i-2} + \varvec{\eta }_i$ for $i = 1,\ldots ,n$, in which $A_1$, $A_2$, $\varvec{\eta }_i$, and $\epsilon _{i}$ are set in specific cases. Non-zero coefficients of $\varvec{\beta }^*$ are set to be $\beta _{1}^*=\beta _{2}^*=\beta _{3}^*=\beta _{4}^*=\sqrt{2}$. Following three cases are considered:

Case 1: $A_1=(a_{ij})$, where $a_{ij}=0.4^{|i-j|+1}$, $1\le i,j\le p_n$, $A_2 = 0$. For each i, $\varvec{\eta }_i\sim N(\varvec{0}, \mathbb {I}_{p_n})$. The error term follows $\epsilon _{i}=0.5\epsilon _{i-1}+e_i$, $e_i\sim t(5)$.
Case 2: $A_1=1.2\mathbb {I}_{p_n}, A_2 = -0.5\mathbb {I}_{p_n}$. For each i, $\varvec{\eta }_i\sim N(\varvec{0}, \varvec{\Sigma }_{\varvec{\eta }})$, $\varvec{\Sigma }_{\varvec{\eta }} = (\sigma _{ij})$, where $\sigma _{ij}=0.5^{|i-j|}$, $1\le i,j\le p_n$. The error term follows $\epsilon _{i}=-0.5\epsilon _{i-1}+0.3\epsilon _{i-2}+e_i$, $e_i\sim t(5)$.
Case 3: $A_1=0.8\mathbb {I}_{p_n}, A_2 = 0$. For each i, $\varvec{\eta }_i = \varvec{u}_i + B_1\varvec{u}_{i-1}+B_2\varvec{u}_{i-2}$, $\varvec{u}_i\sim N(\varvec{0}, \mathbb {I}_{p_n})$, $B_1=0.6\mathbb {I}_{p_n}, B_2 = -0.4\mathbb {I}_{p_n}$. The error term follows $\epsilon _{i}=0.5\epsilon _{i-1}+e_i$, $e_i\sim t(5)$.

The quantile level $\tau =0.5$ is tested in each setting (Table 21).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Jiang, X., Liang, Y. & Wang, H. Screen then select: a strategy for correlated predictors in high-dimensional quantile regression. Stat Comput 34, 112 (2024). https://doi.org/10.1007/s11222-024-10424-6

Download citation

Received: 10 January 2024
Accepted: 12 March 2024
Published: 08 April 2024
DOI: https://doi.org/10.1007/s11222-024-10424-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Screen then select: a strategy for correlated predictors in high-dimensional quantile regression

Abstract

Access this article

Similar content being viewed by others

Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Statistical significance: p value, 0.05 threshold, and applications to radiomics—reasons for a conservative approach

Data availibility

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A. Useful lemmas

Lemma 1

Proof of Lemma 1

Lemma 2

Proof of Lemma 2

Lemma 3

Proof of Lemma 3

Appendix B. Proof of Proposition 1

Appendix C. Proof of Theorem 1

Appendix D. Proof of Theorem 2

Appendix E. Proof of Theorem 3

Appendix F. Addition simulations

1.1 Example 1: Sure screening rate for \(n=200\), \(p_n=5000\) when model size \(d_n=n-1\)

1.2 Example 2: Selection performance when \(n=300\) at \(\tau =0.5\)

1.3 Example 3: Selection performance when \(n=100\) at \(\tau =0.8\)

1.4 Example 4: Selection performance when \(n=300\) at \(\tau =0.8\)

1.5 Example 5: Weak correlations

1.6 Example 6: Time series

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation