Skip to main content

Forward Selection for Feature Screening and Structure Identification in Varying Coefficient Models

Abstract

Varying coefficient models have flexibility and interpretability, and they are widely used in data analysis. Although feature screening procedures have been proposed for ultra-high dimensional varying coefficient models, none of these procedures includes structure identification. That is, existing feature screening procedures for varying coefficient models do not give us information on which covariates have constant coefficients and which covariates have non-constant coefficients among selected covariates. Hence, these procedures do not explicitly select partially linear varying coefficient models, which are much simpler than general varying coefficient models. Motivated by this issue, we propose a forward selection procedure for simultaneous feature screening and structure identification in varying coefficient models. Unlike existing feature screening procedures, our method classifies all covariates into three groups: covariates with constant coefficients, covariates with non-constant coefficients, and covariates with zero coefficients. Thus, our procedure can explicitly select partially linear varying coefficient models. Our procedure selects covariates sequentially until the extended BIC (EBIC) increases. We establish the screening consistency for our method under some conditions. Numerical studies and real data examples support the utility of our procedure.

This is a preview of subscription content, access via your institution.

Availability of Data and Material

The data and material used in this paper is available from the public data repository at http://doi.org/10.18129/B9.bioc.seventyGeneData.

References

  1. Breheny, P. and Huang, J. (2015). Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Stat. Comput. 25, 173–187.

    MathSciNet  Article  Google Scholar 

  2. Chen, J. and Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces. Biometrika 95, 759–771.

    MathSciNet  Article  Google Scholar 

  3. Cheng, M.Y., Feng, S., Li, G. and Lian, H. (2018). Greedy forward regression for variable screening. Austral. New Zealand J. Stat. 60, 2–42.

    MathSciNet  Article  Google Scholar 

  4. Cheng, M.Y., Honda, T., Li, J. and Peng, H. (2014). Nonparametric independence screening and structure identification for ultra-high dimensional longitudinal data. Ann. Stat. 42, 1819–1849.

    MathSciNet  Article  Google Scholar 

  5. Cheng, M.Y., Honda, T. and Zhang, J.T. (2016). Forward variable selection for sparse ultra-high dimensional varying coefficient models. J. Am. Stat. Assoc.111, 1209–1221.

    MathSciNet  Article  Google Scholar 

  6. Fan, J., Feng, Y. and Song, R. (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. J. Am. Stat. Assoc. 106, 544–557.

    MathSciNet  Article  Google Scholar 

  7. Fan, J., Ma, Y. and Dai, W. (2014). Nonparametric independence screening in sparse ultra-high-dimensional varying coefficient models. J. Am. Stat. Assoc.109, 1270–1284.

    MathSciNet  Article  Google Scholar 

  8. Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360.

    MathSciNet  Article  Google Scholar 

  9. Fan, J. and Lv, J. (2008). Sure independence screening for ultra-high dimensional feature space. J. R. Stat. Soc. Series B 70, 849–911.

    MathSciNet  Article  Google Scholar 

  10. Greene, W.H. (2012). Econometric Analysis, 7th edn. Pearson Education, Harlow.

    Google Scholar 

  11. Honda, T., Ing, C.K. and Wu, W.Y. (2019). Adaptively weighted group Lasso for semiparametric quantile regression models. Bernoulli 25, 3311–3338.

    MathSciNet  Article  Google Scholar 

  12. Honda, T. and Lin, C.-T. (2021). Forward variable selection for sparse ultra-high dimensional generalized varying coefficient models. Japanese J. Stat. Data Sci. 4, 151–179.

    MathSciNet  Article  Google Scholar 

  13. Honda, T. and Yabe, R. (2017). Variable selection and structure identification for varying coefficient Cox models. J. Multivar. Anal. 161, 103–122.

    MathSciNet  Article  Google Scholar 

  14. Horn, R.A. and Johnson, C.R. (2013). Matrix Analysis, 2nd edn. Cambridge University Press, Cambridge.

    MATH  Google Scholar 

  15. Huber, W., Carey, V.J., Gentleman, R., Anders, S., Carlson, M., Carvalho, B.S., Bravo, H.C., Davis, S., Gatto, L., Girke, T., Gpttardo, R., Hahne, F., Hansen, K.D., Irizarry, R.A., Lawrence, M., Love, M.I., MacDonald, J., Obenchain, V., Oles, A.K., Reyes, H., Shannon, A., Smyth, P., Tenebaum, G.K., Waldron, D., Morgan, L. and Pages, M (2015). Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods 12, 115–121.

    Article  Google Scholar 

  16. Josse, J. and Husson, F. (2016). missMDA: a package for handling missing values in mulitvariate data analysis. J. Stat. Softw. 7, 1, 1–31.

    Google Scholar 

  17. Lee, E.R., Noh, H. and Park, B.U. (2014). Model selection via Bayesian information criterion for quantile regression models. J. Am. Stat. Assoc.109, 216–229.

    MathSciNet  Article  Google Scholar 

  18. Li, G., Peng, H., Zhang, J. and Zhu, L. (2012a). Robust rank correlation based screening. Ann. Stat. 40, 1846–1877.

    MathSciNet  MATH  Google Scholar 

  19. Li, R., Zhong, W. and Zhu, L. (2012b). Feature screening via distance correlation learning. J. Am. Stat. Assoc. 107, 1129–1139.

    MathSciNet  Article  Google Scholar 

  20. Liu, J., Li, R. and Wu, R. (2014). Feature selection for varying coefficient models with ultrahigh-dimensional covariates. J. Am. Stat. Assoc. 109, 266–274.

    MathSciNet  Article  Google Scholar 

  21. Liu, J.Y., Zhong, W. and Li, R.Z. (2015). A selective overview of feature screening for ultrahigh-dimensional data. Sci. China Math. 58, 2033–2054.

    MathSciNet  MATH  Google Scholar 

  22. Luigi, M., Bahman, A., Donald, G. and Jeffrey, T.L. (2013). A simple and reproducible breast cancer prognostic test. BMC Genomics 14.

  23. Luo, S. and Chen, Z. (2014). Sequential Lasso cum EBIC for feature selection with ultra-high dimensional feature space. J. Amer. Stat. Assoc. 109, 1229–1240.

    MathSciNet  Article  Google Scholar 

  24. Mai, Q. and Zou, H. (2015). The fused Kolmogorov filter: a nonparametric model-free screening method. Ann. Stat. 43, 1471–1497.

    MathSciNet  Article  Google Scholar 

  25. Serban, N. (2011). A space-time varying coefficient model: the equity of service accessibility. Ann. Appl. Stat. 5, 2024–2051.

    MathSciNet  Article  Google Scholar 

  26. Song, R., Yi, F. and Zou, H. (2014). On varying-coefficient independence screening for high-dimensional varying-coefficient models. Stat. Sin. 24, 1735–1752.

    MathSciNet  MATH  Google Scholar 

  27. Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Series B 8, 267–288.

    MathSciNet  MATH  Google Scholar 

  28. van der Vaart, A.W. and Wellner, J.A. (1996). Weak convergence and empirical processes. Springer, New York.

    Book  Google Scholar 

  29. van’t Veer, L.J., Dai, H., van de Vijver, M.J., He, Y.D., Hart, A.A.M., Mao, M., Peterse, H.L., van der Kooy, K., Marton, M.J., Witteveen, A.T., Schreiber, G.J., Kerkhoven, R.M., Roberts, C., Linsley, P.S., Bernards, R. and Friend, S.H. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536.

    Article  Google Scholar 

  30. Wang, H. (2009). Forward regression for ultra-high dimensional variable screening. J. Am. Stat. Assoc. 104, 1512–1524.

    MathSciNet  Article  Google Scholar 

  31. Wang, H. and Xia, Y. (2009). Shrinkage estimation of the varying coefficient model. J. Am. Stat. Assoc. 104, 747–757.

    MathSciNet  Article  Google Scholar 

  32. Wang, L., Li, H. and Huang, J.Z. (2008). Variable selection in nonparametric varying-coefficient models for analysis of repeated measurements. J. Am. Stat. Assoc. 103, 1556–1569.

    MathSciNet  Article  Google Scholar 

  33. Wang, K. and Lin, L. (2016). Robust structure identification and variable selection in partial linear varying coefficient models. J. Stat. Plann. Inf. 174, 153–168.

    MathSciNet  Article  Google Scholar 

  34. Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Series B 68, 49–67.

    MathSciNet  Article  Google Scholar 

  35. Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38, 894–942.

    MathSciNet  Article  Google Scholar 

  36. Zhong, W., Duan, S. and Zhu, L. (2020). Forward additive regression for ultrahigh-dimensional nonparametric additive models. Stat. Sin. 30, 175–192.

    MATH  Google Scholar 

  37. Zhu, L.P., Li, L., Li, R. and Zhu, L.X. (2011). Model-free feature screening for ultra-high dimensional data. J. Am. Stat. Assoc. 106, 1464–1475.

    Article  Google Scholar 

Download references

Acknowledgements

The author would like to thank Professor Toshio Honda for providing a code to generate the orthonormal basis and giving helpful advice. He also would like to appreciate the Editor and the anonymous three reviewers for their constructive comments, which give significant improvement on presentation of this paper.

Funding

The author received no financial support for this research.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Akira Shinkyu.

Ethics declarations

Conflict of Interests

The author declares no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Proofs

Appendix A: Proofs

To prove main theorems, we need Lemma A.1.

Lemma A.1.

Suppose that Assumptions 1-3 hold with M = O(nξ) for some 0 < ξ < 1/10. Then, there exists some sequence of positive numbers {δ1n} tending to zero sufficiently slowly such that with probability tending to one, we have

$$ \frac{\tau_{\min}}{L}(1-\delta_{1n}) \leq \lambda_{\min}(\widehat{\boldsymbol{{\varSigma}}}_{S}) \leq \lambda_{\max}(\widehat{\boldsymbol{{\varSigma}}}_{S}) \leq \frac{\tau_{\max}}{L}(1+\delta_{1n}), \\ $$
(A.1)

for any \(S\subset \mathcal {F}\) satisfying |S|≤ M.

Proof of Lemma A.1.

Note that there exists some positive constant α such that ξ + α < 1/10. Let δ1n = nα. We will show

$$ P\left( |\lambda_{\min}(\widehat{\boldsymbol{{\varSigma}}}_{S})-\lambda_{\min}(\boldsymbol{{\varSigma}}_{S}) |>\frac{\tau_{\min}}{L}\delta_{1n}\right)\rightarrow 0. $$
(A.2)

Note that eigenvalues of \(\widehat {\boldsymbol {{\varSigma }}}_{S}\) and ΣS are nonnegative because these matrices are positive semidefinite. We also note that the k th largest singular value is equal to the k th largest eigenvalue for any positive semidefinite matrix. Hence, from Corollary 7.3.5 of Horn and Johnson (2013), we get

$$ \left|\lambda_{k}(\widehat{\boldsymbol{{\varSigma}}}_{S})-\lambda_{k}(\boldsymbol{{\varSigma}}_{S}) \right|\leq \|\widehat{\boldsymbol{{\varSigma}}}_{S}-\boldsymbol{{\varSigma}}_{S} \| \ \ \text{for } k=1, \ldots, |\mathcal{F}_{c}\cap S|+|\mathcal{F}_{v}\cap S|(L-1), $$
(A.3)

where \(\{\lambda _{k}(\widehat {\boldsymbol {{\varSigma }}}_{S})\}\) and {λk(ΣS)} are non increasingly ordered eigenvalues of \(\widehat {\boldsymbol {{\varSigma }}}_{S}\) and ΣS, respectively. Thus, we obtain

$$ |\lambda_{\min}(\widehat{\boldsymbol{{\varSigma}}}_{S})- \lambda_{\min}(\boldsymbol{{\varSigma}}_{S})|\leq \|\widehat{\boldsymbol{{\varSigma}}}_{S}-\boldsymbol{{\varSigma}}_{S}\|. $$
(A.4)

Note that the numbers of row and column of \(\widehat {\boldsymbol {{\varSigma }}}_{S}\) and ΣS are equal to \(|\mathcal {F}_{c}\cap S|+|\mathcal {F}_{v}\cap S|(L-1)\). Since \(|\mathcal {F}_{v}\cap S|\leq M\) and \(|\mathcal {F}_{c} \cap S|\leq M\), we have \(\|\widehat {\boldsymbol {{\varSigma }}}_{S}-\boldsymbol {{\varSigma }}_{S}\|\leq ML\|\widehat {\boldsymbol {{\varSigma }}}_{S}-\boldsymbol {{\varSigma }}_{S}\|_{\infty }\) by the Cauchy Schwarz inequality, and

$$ P\left( |\lambda_{\min}(\widehat{\boldsymbol{{\varSigma}}}_{S})-\lambda_{\min}(\boldsymbol{{\varSigma}}_{S})|> \frac{\tau_{\min}}{L}\delta_{1n} \right)\leq P\left( \|\widehat{\boldsymbol{{\varSigma}}}_{S}- \boldsymbol{{\varSigma}}_{S}\|_{\infty}> \frac{\tau_{\min}}{ML^{2}}\delta_{1n} \right), $$
(A.5)

for any S satisfying |S|≤ M. Under Assumption 3, Lemmas A.1 and A.2 of Fan et al. (2014) hold. Thus, for k, j = 1,…,p, s, l = 1,…,L, and m ≥ 2, we have

$$ \begin{array}{@{}rcl@{}} &&E\left( |X_{ik}B_{s}(Z_{i})B_{l}(Z_{i})X_{ij} -E[X_{ik}B_{s}(Z_{i})B_{l}(Z_{i})X_{ij}]|^{m}\right) \\ &\leq& 2^{m}e{K}^{m}m!E[|B_{s}(Z_{i})B_{l}(Z_{i})|^{m}], \end{array} $$
(A.6)

where K is a positive constant. Recall that B(z) = A0B0(z), and \(C_{3}\leq \lambda _{\max \limits }({\boldsymbol {A}_{0}^{T}}\boldsymbol {A}_{0})\leq C_{4}\) for some positive constants C3 and C4. See Appendix A of Honda and Yabe (2017). Hence, for l = 1,…,L, we can denote \(B_{l}(z)={\sum }_{s=1}^{L}a_{ls}B_{0s}(z)\) where als is the (l, s) element of A0. Note that \({\sum }_{k=1}^{L}B_{0k}(z)=1\) due to the properties of the B-spline basis. Therefore, for s, l = 1,…,L, we obtain

$$ \begin{array}{@{}rcl@{}} |B_{s}(Z_{i})B_{l}(Z_{i})|&=&\left|\left\{\sum\limits_{j=1}^{L}a_{sj}B_{0j}(Z_{i})\right\}\left\{\sum\limits_{k=1}^{L}a_{lk}B_{0k}(Z_{i})\right\}\right| \\ &\leq& \max_{1\leq j\leq L}|a_{sj}|\times \max_{1\leq k\leq L}|a_{lk}| \leq \|\boldsymbol{A}_{0}\|_{\infty}^{2} \leq \|\boldsymbol{A}_{0}\|^{2}\leq C_{4}. \end{array} $$
(A.7)

As a result, we get \(E[|B_{s}(Z_{i})B_{l}(Z_{i})|^{m}] \leq {C_{4}}^{m}\) and the right hand side of the last inequality in Eq. A.6 can be bounded by \(\leq {(2KC_{4})^{m-2}m!8K^{2}{C_{4}}^{2}e}{2}^{-1}\) for any m ≥ 2. Thus, from Bernstein’s inequality (see Lemma 2.2.11 of van der Vaart and Wellner (1996)), we get

$$ \begin{array}{@{}rcl@{}} &&P\left( \left|\frac{1}{n}\sum\limits_{i=1}^{n}\{X_{ik}B_{s}(Z_{i})B_{l}(Z_{i})X_{ij} -E[X_{ik}B_{s}(Z_{i})B_{l}(Z_{i})X_{ij}]\}\right|>\frac{\tau_{\min}}{ML^{2}}\delta_{1n}\right)\\ &\leq& 2\exp\left( \frac{-n\tau_{\min}\delta_{1n}}{16M^{2}L^{4}K^{2}{C_{4}^{2}}e(\tau_{\min}\delta_{1n})^{-1}+4ML^{2}KC_{4}}\right). \end{array} $$
(A.8)

Since M = O(nξ), Mνmnξ for some νm > 0. Thus, the right hand side of Eq. A.8 can be bounded by

$$ \leq 2\exp\left( \frac{-\tau_{\min}n^{1/5-2\xi-2\alpha}}{K^{*}\{1+n^{-2/5-\xi-\alpha}\}}\right), $$
(A.9)

where \(K^{*}=\max \limits \{16K^{2}{C_{4}^{2}}e{\nu _{m}^{2}}{C_{L}^{4}}\tau _{\min \limits }^{-1}, 4KC_{4}\nu _{m} {C_{L}^{2}}\}\). Let A = {j : 1jS}. Note that {j : − 1jS}⊂{j : 1jS}. Since |A|≤|S|≤ M, the right hand side of Eq. A.5 can be further bounded by

$$ \begin{array}{@{}rcl@{}} &\leq& \sum\limits_{k,j \in A} \sum\limits_{1\leq s,l\leq L}2\exp\left( \frac{-\tau_{\min}n^{1/5-2\xi-2\alpha}}{K^{*}\{1+n^{-2/5-\xi-2\alpha}\}}\right) \\ &\leq& 2{\nu_{m}^{2}}{C_{L}^{2}}n^{2\xi+2/5}\exp\left( \frac{-\tau_{\min}n^{1/5-2\xi-2\alpha}}{K^{*}\{1+n^{-2/5-\xi-\alpha}\}}\right). \end{array} $$
(A.10)

The right hand side of the last inequality in Eq. A.10 converges to zero as \(n \rightarrow \infty \). Thus, Eq. A.2 holds. We can also get

$$ P\left( |\lambda_{\max}(\widehat{\boldsymbol{{\varSigma}}}_{S})-\lambda_{\max}(\boldsymbol{{\varSigma}}_{S}) |>\frac{\tau_{\max}}{L}\delta_{1n}\right)\rightarrow 0 , $$
(A.11)

in the same way. Eqs. A.2A.11, and Assumption 2 imply

$$ P\left( \frac{\tau_{\min}}{L}(1-\delta_{1n}) \leq \lambda_{\min}(\widehat{\boldsymbol{{\varSigma}}}_{S}) \leq \lambda_{\max}(\widehat{\boldsymbol{{\varSigma}}}_{S}) \leq \frac{\tau_{\max}}{L}(1+\delta_{1n})\right)\rightarrow 1, $$
(A.12)

for any S satisfying |S|≤ M. This completes the proof of Lemma A.1. □

Corollary A.1 is derived by Lemma A.1 immediately.

Corollary A.1.

Suppose that Assumptions 1-3 hold with M = O(nξ) for some 0 < ξ < 1/10. Then, with probability tending to one, we have

$$ \frac{\tau_{\min}}{L}(1-\delta_{1n}) \leq\lambda_{\min}\left( \frac{1}{n}{\boldsymbol{W}_{E}^{T}}\boldsymbol{Q}_{S}\boldsymbol{W}_{E}\right)\leq\lambda_{\max}\left( \frac{1}{n}{\boldsymbol{W}_{E}^{T}}\boldsymbol{Q}_{S}\boldsymbol{W}_{E}\right)\leq \frac{\tau_{\max}}{L}(1+\delta_{1n}), $$
(A.13)

for any \(S\subset \mathcal {F}\) and \(E \subset \mathcal {F}\) satisfying |ES|≤ M, where QS is the orthogonal projection defined in Eq. A.20.

Proof of Corollary A.1.

Note that WES = (WE, WS). By (A-74) in Greene (2012), \(({\boldsymbol {W}_{E}^{T}}\boldsymbol {Q}_{S}\boldsymbol {W}_{E})^{-1}\) is the upper left block of \(({\boldsymbol {W}}_{E\cup S}^{T}\) WES)− 1. Notice that \(\lambda _{\max \limits }(\boldsymbol {A}^{-1})=\lambda _{\min \limits }(\boldsymbol {A})^{-1}\) and \(\lambda _{\min \limits }(\boldsymbol {A}^{-1})=\) \(\lambda _{\max \limits }(\boldsymbol {A})^{-1}\) for any symmetric and invertible matrix A. Hence, we get

$$ \lambda_{\min}(\boldsymbol{W}_{E \cup S}^{T}\boldsymbol{W}_{E \cup S})\leq \lambda_{\min}({\boldsymbol{W}_{E}^{T}}\boldsymbol{Q}_{S}\boldsymbol{W}_{E}), $$

and

$$ \lambda_{\max}({\boldsymbol{W}_{E}^{T}}\boldsymbol{Q}_{S}\boldsymbol{W}_{E})\leq \lambda_{\max}(\boldsymbol{W}_{E \cup S}^{T}\boldsymbol{W}_{E \cup S}). $$

By Lemma A.1, we have

$$ \frac{n\tau_{\min}}{L}(1-\delta_{1n})\!\leq\! \lambda_{\min}(\boldsymbol{W}_{E\cup S}^{T}\boldsymbol{W}_{E\cup S})\!\leq\! \lambda_{\max}(\boldsymbol{W}_{E\cup S}^{T}\boldsymbol{W}_{E\cup S}) \!\leq\! \frac{n\tau_{\max}}{L}(1+\delta_{1n}) $$
(A.14)

with probability tending to one. Hence, the proof of Corollary A.1 is completed. □

Proof of Theorem 1.

We can prove Theorem 1 almost in the same way as in Cheng et al. (2016). However, since there are some differences because we are also considering structure identification, we give the proof of Theorem 1 below. Firstly, we can represent Eq. 2.1 by

$$ \begin{array}{@{}rcl@{}} Y_{i}&=&\sum\limits_{j=1}^{p}X_{ij}g_{cj}+\sum\limits_{j=1}^{p}X_{ij}g_{vj}(Z_{i})+\epsilon_{i}=\sum\limits_{j\in \mathcal{S}_{c}}X_{ij}g_{cj}+\sum\limits_{j\in \mathcal{S}_{v}}X_{ij}g_{vj}(Z_{i})+\epsilon_{i}\\ &=&\sum\limits_{j\in\mathcal{S}_{c}}X_{ij}B_{1}(Z_{i})\gamma_{1j}+\sum\limits_{j\in \mathcal{S}_{v}}X_{ij}\boldsymbol{B}_{-1}(Z_{i})^{T}\boldsymbol{\gamma}_{-1j}+\delta_{i} +\epsilon_{i}, \end{array} $$
(A.15)

where

$$ \delta_{i}=\sum\limits_{j \in\mathcal{S}_{v}}X_{ij}\{g_{vj}(Z_{i})-\boldsymbol{B}_{-1}(Z_{i})^{T}\boldsymbol{\gamma}_{-1j}\}, $$
(A.16)

as in Honda et al. (2019). If γj is suitably chosen, there exists some positive \(C_{g}^{\prime }\) such that

$$ \sum\limits_{j=1}^{p}\|g_{j} -\boldsymbol{B}^{T}\boldsymbol{\gamma}_{j}\|_{\infty} \leq C_{g}^{\prime}L^{-2}, $$
(A.17)

under Assumption 4. Note that |gcj|2 = |γ1j|2L− 1 and \(\|g_{vj}\|^{2}=\|\boldsymbol {\gamma }_{-1j}\|^{2}L^{-1}+O(L^{-4})\). See Appendix A of Honda and Yabe (2017) for properties of the orthonormal basis. Then, there exist some positive constants C1 and C2 such that \(C_{1}L \|g_{vj}\|^{2} \leq \|{\boldsymbol {\gamma }}_{-1j}\|^{2} \leq C_{2}L\|g_{vj}\|^{2}\). Under Assumption 6, we have

$$ |\delta_{i}|\leq C_{X}C_{g}^{\prime}L^{-2}\rightarrow 0. $$
(A.18)

The matrix form of Eq. A.15 is represented by

$$ \boldsymbol{Y}=\sum\limits_{j\in\mathcal{S}_{c}}\boldsymbol{W}_{1j}{\gamma}_{1j}+\sum\limits_{j\in\mathcal{S}_{v}}\boldsymbol{W}_{-1j}\boldsymbol{\gamma}_{-1j}+\boldsymbol{\delta}+\boldsymbol{\epsilon} =\boldsymbol{W}_{\mathcal{S}_{0}} \boldsymbol{\gamma}_{\mathcal{S}_{0}} + \boldsymbol{\delta} + \boldsymbol{\epsilon}. $$
(A.19)

Let us denote

$$ \boldsymbol{P}_{S}=\boldsymbol{W}_{S}({\boldsymbol{W}_{S}^{T}} \boldsymbol{W}_{S})^{-1} {\boldsymbol{W}_{S}^{T}},\ \ \boldsymbol{Q}_{S}= \boldsymbol{I} -\boldsymbol{P}_{S}, $$
(A.20)
$$ \widetilde{\boldsymbol{W}}_{lS}= \boldsymbol{Q}_{S}\boldsymbol{W}_{l}, \boldsymbol{H}_{lS} =\widetilde{\boldsymbol{W}}_{lS}(\widetilde{\boldsymbol{W}}_{lS}^{T} \widetilde{\boldsymbol{W}}_{lS} )^{-1}\widetilde{\boldsymbol{W}}_{lS}^{T}, $$

for any S satisfying \(|S \cup \mathcal {S}_{0}|\leq M\) and \(\mathcal {S}_{0} \not \subset S\). We can easily show that HlSQS = HlS and \(n \hat {\sigma ^{2}_{S}} - n\hat \sigma ^{2}_{S(l)} =\|\boldsymbol {H}_{lS} \boldsymbol {Y}\|^{2}\). From the triangle inequality, we have

$$ \max_{l \in \mathcal{S}_{0} \cap S^{c}}\|\boldsymbol{H}_{lS}\boldsymbol{Y}\|\geq \max_{l \in \mathcal{S}_{0} \cap S^{c}}\| \boldsymbol{H}_{lS}\boldsymbol{W}_{\mathcal{S}_{0}} \boldsymbol{\gamma}_{\mathcal{S}_{0}}\|- \|\boldsymbol{H}_{lS}\boldsymbol{\delta}\| - \| \boldsymbol{H}_{lS}\boldsymbol{\epsilon} \|. $$
(A.21)

We evaluate each part in the right hand side of Eq. A.21. Since HlS is a projection matrix, we have

$$ \|\boldsymbol{H}_{lS} \boldsymbol{\delta}\| \leq \|\boldsymbol{\delta} \| \leq \sqrt{n} C_{X} C_{g}^{\prime} L^{-2}. $$
(A.22)

Hence, we have ∥HlSδ∥ = OP(n1/10). Next, we evaluate the third term in Eq. A.21. Under Assumption 5, Proposition 3 of Zhang (2010) holds. Thus, we have

$$ P\left( \frac{\|\boldsymbol{H}_{lS}\boldsymbol{\epsilon}\|}{m{\sigma^{2}_{1}}}^{2} \geq \frac{1+x}{\{1-2/(e^{x/2}\sqrt{1+x}-1)\}^{2}_{+}} \right)\leq e^{-mx/2}(1+x)^{m/2}, $$
(A.23)

for all x > 0 and for any \(l\in \mathcal {F}\cap S^{c}\), where m = rank(HlS). We take \(x=\log n\). If \(l\in \mathcal {F}_{c}\), then rank(HlS) = 1 and \(\|{\boldsymbol {H}}_{lS}\boldsymbol {\epsilon }\|^{2} =O_{P}(\log n)\), which implies \(\|{\boldsymbol {H}}_{lS}\boldsymbol {\epsilon }\|^{2}=O_{P}(L\log n)\). If \(l \in \mathcal {F}_{v}\), then rank(HlS) = (L − 1) and \(\|{\boldsymbol {H}}_{lS}\boldsymbol {\epsilon }\|^{2} =O_{P}(L\log n)\). Thus, we have

$$ \|\boldsymbol{H}_{lS} \boldsymbol{\epsilon}\|=O_{P}(L^{1/2}(\log n)^{1/2} )=O_{P}(n^{1/10}(\log n)^{1/2}). $$
(A.24)

Next, we evaluate \(\max \limits _{l \in \mathcal {S}_{0} \cap S^{c}}\| \boldsymbol {H}_{lS}\boldsymbol {W}_{\mathcal {S}_{0}}\boldsymbol {\gamma }_{\mathcal {S}_{0}}\|\). Notice that

$$ \|\boldsymbol{H}_{lS}\boldsymbol{W}_{\mathcal{S}_{0}}\boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} \geq \lambda_{\min} ((\widetilde{\boldsymbol{W}}_{lS}^{T} \widetilde{\boldsymbol{W}}_{lS} )^{-1}) \|\widetilde{{\boldsymbol{W}}}_{lS}^{T}{\boldsymbol{W}}_{\mathcal{S}_{0}}{\boldsymbol{\gamma}}_{\mathcal{S}_{0}} \|^{2}. $$
(A.25)

Note that \(\lambda _{\min \limits }((\widetilde {\boldsymbol {W}}_{lS}^{T} \widetilde {\boldsymbol {W}}_{lS})^{-1})=\lambda _{\min \limits }(({\boldsymbol {W}_{l}^{T}}\boldsymbol {Q}_{S}\boldsymbol {W}_{l})^{-1})\) and |{l}∪ S|≤ M. From Corollary A.1, we get

$$ \max_{l \in \mathcal{S}_{0} \cap S^{c}}\|\boldsymbol{H}_{lS}\boldsymbol{W}_{\mathcal{S}_{0}}\boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} \geq \frac{L}{n \tau_{\max}(1+\delta_{1n})} \max_{l \in \mathcal{S}_{0}\cap S^{c}}\|\widetilde{{\boldsymbol{W}}}_{lS}^{T}{\boldsymbol{W}}_{\mathcal{S}_{0}}{\boldsymbol{\gamma}}_{\mathcal{S}_{0}} \|^{2}, $$
(A.26)

with probability tending to one. We evaluate \(\max \limits _{l \in \mathcal {S}_{0}\cap S^{c}} \|\widetilde {\boldsymbol {W}}_{lS}^{T}\boldsymbol {W}_{\mathcal {S}_{0}}\boldsymbol {\gamma }_{\mathcal {S}_{0}} \|^{2} \). We can see that

$$ \max_{l \in \mathcal{S}_{0} \cap S^{c} } \|\widetilde{\boldsymbol{W}}_{lS}^{T}\boldsymbol{W}_{\mathcal{S}_{0}}\boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} =\max_{l \in \mathcal{S}_{0} \cap S^{c} } \|{\boldsymbol{W}_{l}^{T}} \boldsymbol{Q}_{S} \boldsymbol{W}_{\mathcal{S}_{0}}\boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2}. $$

Notice that \(|g_{cj}|^{2}+\|g_{vj}\|^{2}_{2}=\|g_{j}\|^{2}_{2} \leq \|g_{j}\|_{\infty }^{2}\). From Cauchy-Schwarz inequality, we get

$$ \begin{array}{@{}rcl@{}} &&\|\boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}}\boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} \leq \|\boldsymbol{\gamma}_{\mathcal{S}_{0}}\| \| \boldsymbol{W}^{T}_{\mathcal{S}_{0}} \boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}}\boldsymbol{\gamma}_{\mathcal{S}_{0}}\|\\ &=&\left( \sum\limits_{j\in\mathcal{S}_{c}}|\gamma_{1j}|^{2} +\sum\limits_{j \in \mathcal{S}_{v}}\| \boldsymbol{\gamma}_{-1j}\|^{2} \right)^{1/2} \left( \sum\limits_{j\in \mathcal{S}_{0}\cap S^{c} } \|{\boldsymbol{W}^{T}_{j}} \boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}} \boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} \right)^{1/2}\\ &\leq& \left( L\sum\limits_{j\in\mathcal{S}_{c}} |g_{cj}|^{2} +C_{2}L\sum\limits_{j \in \mathcal{S}_{v}} \|g_{vj}\|_{2}^{2} \right)^{1/2} \left( \max_{l \in \mathcal{S}_{0} \cap S^{c}}\|{\boldsymbol{W}^{T}_{l}} \boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}} \boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} 2p_{0} \right)^{1/2}\\ &\leq& 2^{1/2}p_{0}^{1/2} L^{1/2}C_{g}(1+C_{2})^{1/2} \left( \max_{l \in \mathcal{S}_{0} \cap S^{c}} \|{\boldsymbol{W}^{T}_{l}} \boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}} \boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} \right)^{1/2}, \end{array} $$
(A.27)

with probability tending to one. Thus, we get

$$ \max_{l \in \mathcal{S}_{0} \cap S^{c}}\|{\boldsymbol{W}^{T}_{l}} \boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}} \boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} \geq \frac{\|\boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}}\boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{4}}{2p_{0}L{C_{g}^{2}}(1+C_{2})}. $$
(A.28)

Since \(|\mathcal {S}_{0}\cup S|\leq M\), we have

$$ \|\boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}} \boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} \geq \lambda_{\min}\left( \boldsymbol{W}_{\mathcal{S}_{0}}^{T}\boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}}\right) \|\boldsymbol{\gamma}_{\mathcal{S}_{0}}\|^{2} \geq \frac{n\tau_{\min}}{L}(1-\delta_{1n})\|\boldsymbol{\gamma}_{\mathcal{S}_{0}}\|^{2}, $$
(A.29)

by Corollary A.1. Notice that

$$ \begin{array}{@{}rcl@{}} &&\|\boldsymbol{Q}_{S}\boldsymbol{W}_{\mathcal{S}_{0}} \boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{4}\geq \frac{n^{2}\tau_{\min}^{2}}{L^{2}}(1-\delta_{1n})^{2} \|\boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{4}\\ &&\geq\frac{n^{2}\tau_{\min}^{2}}{L^{2}}(1-\delta_{1n})^{2}\left( L\max_{j \in \mathcal{S}_{c}}|g_{cj}|^{2}+LC_{1}\max_{j \in \mathcal{S}_{v}}\|g_{vj}\|^{2}_{2}\right)^{2}\\ &&\geq {n^{2}\tau_{\min}^{2}}(1-\delta_{1n})^{2}C_{\min}^{2}(1+C_{1})^{2}, \end{array} $$
(A.30)

with probability tending to one. Note that \(\min \limits \{\max \limits _{j \in \mathcal {S}_{c}}|g_{cj}|^{2}, \max \limits _{j \in \mathcal {S}_{v}}\|g_{vj}\) \(\|_{2}^{2}\}>C_{\min \limits }\) under Assumption 4. From Eqs. A.26A.28, and A.30, we get

$$ \max_{l \in \mathcal{S}_{0} \cap S^{c}}\|\boldsymbol{H}_{lS} \boldsymbol{W}_{\mathcal{S}_{0}} \boldsymbol{\gamma}_{\mathcal{S}_{0}} \|^{2} \geq \frac{n(1-\delta_{1n})^{2} }{(1+\delta_{1n})} \frac{\tau_{\min}^{2}C_{\min}^{2} (1+C_{1})^{2}}{\tau_{\max} {C_{g}^{2}}(1+C_{2})2p_{0}}, $$
(A.31)

with probability tending to one. Equation A.31 implies that the first term in the right hand side of Eq. A.21 dominates the other two terms. Thus, we have

$$ \max_{l \in \mathcal{S}_{0} \cap S^{c}}\|\boldsymbol{H}_{lS}\boldsymbol{Y}\|^{2}\geq 4^{-1}\max_{l \in \mathcal{S}_{0} \cap S^{c}}\| \boldsymbol{H}_{lS}\boldsymbol{W}_{\mathcal{S}_{0}}\boldsymbol{\gamma}_{\mathcal{S}_{0}}\|^{2}, $$
(A.32)

with probability tending to one. By taking {δ2n} as δ2n = 1 − (1 − δ1n)2(1 + δ1n)− 1, we get

$$ \max_{l\in \mathcal{F}\cap S^{c}}\{ n \hat{\sigma}_{S}^{2} -n \hat{\sigma}^{2}_{S\cup\{l\}}\}\geq n(1-\delta_{2n})\frac{\tau_{\min}^{2} C_{\min}^{2}(1+C_{1})^{2}}{4\tau_{\max}{C_{g}^{2}}(1+C_{2})2p_{0}} =n(1-\delta_{2n})D^{*}, $$
(A.33)

for any S satisfying \(|S\cup \mathcal {S}_{0}|\leq M\) with probability tending to one, and the Proof of Theorem 1 is completed. □

Proof of Theorem 2.

Assume Theorem 2 is false. Then, we have \(\mathcal {S}_{v} \not \subset F_{k}\) or \(\mathcal {S}_{c} \not \subset E_{k}\) for k = 1,⋯ ,T. This implies that \(\mathcal {S}_{0} \not \subset S_{k}\) and \(|S_{k} \cup \mathcal {S}_{0}|\leq M\) for k = 1,⋯ ,T. Note that we have \(n^{-1}{\sum }_{i=1}^{n}{Y_{i}^{2}} \geq \hat {\sigma }_{S_{k}(l)}^{2}\) and \(\hat \sigma ^{2}_{S_{k}}-\hat \sigma ^{2}_{S_{k}\cup \{l\}} \leq n^{-1}{\sum }_{i=1}^{n}{Y_{i}^{2}}\) for any \(l \in \mathcal {F}\cap {S_{k}^{c}}\), and \(\log (1+x)\geq x(1+x)^{-1}\) for all x > − 1. Hence, we have

$$ \begin{array}{@{}rcl@{}} &&\text{EBIC}(S_{k}) -\text{EBIC}(S_{k}\cup\{l\}) =n\log \left( \frac{\hat\sigma^{2}_{S_{k}}}{\hat\sigma^{2}_{S_{k}\cup\{l\}}} \right) -(\log n +2\eta \log p)\\ &&\geq n\log \left( 1+ \frac{\hat\sigma^{2}_{S_{k}}-\hat\sigma^{2}_{S_{k}\cup\{l\}}}{\frac{1}{n}\sum\limits_{i=1}^{n}{Y_{i}^{2}}} \right) -(\log n +2\eta \log p)\\ &&\geq \frac{n\hat\sigma^{2}_{S_{k}}-n\hat\sigma^{2}_{S_{k}\cup\{l\}}}{\frac{2}{n}\sum\limits_{i=1}^{n}{Y_{i}^{2}}} -(\log n +2\eta \log p), \end{array} $$
(A.34)

for any \(l \in \mathcal {F}_{c}\cap {S_{k}^{c}}\). Hence, if \(j^{*} \in \mathcal {F}_{c}\cap {S_{k}^{c}}\), we obtain

$$ \text{EBIC}(S_{k}) -\text{EBIC}(S_{k}\cup\{j^{*}\}) \!\geq\! \frac{\max_{l \in \mathcal{F}\cap {S_{k}^{c}}}\{n\hat\sigma^{2}_{S_{k}} - n\hat\sigma^{2}_{S_{k}\cup\{l\}}\}}{\frac{2}{n}\sum\limits_{i=1}^{n}{Y_{i}^{2}}} -(\log n +2\eta \log p), $$
(A.35)

since \(j^{*}=\arg \min \limits _{l \in \mathcal {F}\cap {S_{k}^{c}}} \hat {\sigma }^{2}_{S_{k}\cup \{l\}}\). If \(j^{*} \in \mathcal {F}_{v} \cap {S_{k}^{c}}\), we can also show that

$$ \text{EBIC}(S_{k}) -\text{EBIC}(S_{k}\cup\{j^{*}\})\geq \frac{\max_{l \in \mathcal{F}\cap {S_{k}^{c}}}\{n\hat\sigma^{2}_{S_{k}}-n\hat\sigma^{2}_{S_{k}\cup\{l\}}\}}{\frac{2}{n}\sum\limits_{i=1}^{n}{Y_{i}^{2}}} -(L-1)(\log n +2\eta \log p), $$
(A.36)

in the same way. The second terms of the right hand side in Eqs. A.35 and A.36 are o(n) under Assumption 1. In addition, we have \(\frac {1}{n}{\sum }_{i=1}^{n}{Y_{i}^{2}} \xrightarrow {P} E[Y^{2}]\). Therefore, Theorem 1 implies that the first terms of the right hand side in Eqs. A.35 and A.36 dominate the second terms, and we get

$$ \text{EBIC}(S_{k}) \geq \text{EBIC}(S_{k}\cup\{j^{*}\}) \ \ \text{for } k =1,\cdots, T^{*}, $$
(A.37)

with probability tending to one. Equation A.37 implies that our forward selection procedure has selected some index at each k th step for k = 1,⋯ ,T + 1. Thus, we have the monotone increasing sequence and the non increasing sequence \(\{S_{k} \}_{k=1}^{T^{*}+1}\) and \(\{\hat \sigma _{S_{k}}\}_{k=1}^{T^{*}+1}\), respectively. Note that

$$ \frac{1}{n}\sum\limits_{i=1}^{n} (Y_{i} -\bar{Y})^{2} \geq \hat\sigma^{2}_{S_{1}} -\hat\sigma^{2}_{S_{T^{*}+1}} =\sum\limits_{k=1}^{T^{*}}(\hat\sigma^{2}_{S_{k}}-\hat\sigma^{2}_{S_{k+1}}) =\sum\limits_{k=1}^{T^{*}} \max_{l \in \mathcal{F}\cap {S^{c}_{k}}}\{ \hat\sigma^{2}_{S_{k}} -\hat\sigma^{2}_{S_{k}\cup\{l\}}\}, $$
(A.38)

where \(\bar {Y}=n^{-1}{\sum }_{i=1}^{n}Y_{i}\). Theorem 1 implies \(\text {Var}(Y)\geq {\sum }_{k=1}^{T^{*}}(1-\delta _{2n})D^{*} = T^{*}(1-\delta _{2n})D^{*}\) with probability tending to one. Thus, we have

$$ T^{*}\leq\frac{\text{Var}(Y)}{(1-\delta_{2n})D^{*}} <\frac{\text{Var}(Y)}{(1-2\delta_{2n})D^{*}}\leq T^{*}, $$
(A.39)

with probability tending to one, and a contradiction occurs. This completes the proof. □

Proof of Theorem 3.

Since \(\mathcal {S}_{0} \subset S_{k}\), we have

$$ \hat{\sigma}_{S_{k}}^{2}=n^{-1}\|\boldsymbol{Q}_{S_{k}}\boldsymbol{\delta}\|^{2}+2n^{-1}\boldsymbol{\delta}^{T}\boldsymbol{Q}_{S_{k}}\boldsymbol{\epsilon}+n^{-1}\|\boldsymbol{Q}_{S_{k}}\boldsymbol{\epsilon}\|^{2}. $$

We can see that the first term and the second term of the right hand side are oP(1) by Eq. A.22 and Cauchy-Schwarz inequality. Note that \(\|\boldsymbol {Q}_{S_{k}}\boldsymbol {\epsilon }\|^{2}=\|\boldsymbol {\epsilon }\|^{2}-\|\boldsymbol {P}_{S_{k}}\boldsymbol {\epsilon }\|^{2}\), and \(\|\boldsymbol {P}_{S_{k}}\boldsymbol {\epsilon }\|^{2}=O_{P}(m\log n)\) by Proposition 3 of Zhang (2010), where \(m=\text {rank}(\boldsymbol {P}_{S_{k}})= |\mathcal {F}_{c}\cap S_{k}|+|\mathcal {F}_{v}\cap S_{k}|(L-1)\). Notice that |Sk|≤ 2k ≤ 2M, and Mνmnξ for some νm > 0, since M = O(nξ). Thus, we have \(n^{-1}m\log n \leq n^{-1}2ML\log n \leq 2C_{L}\nu _{m} n^{\xi +1/5-1}\log n \rightarrow 0\) as \(n\rightarrow \infty ,\) and we get \(\|\boldsymbol {P}_{S_{k}}\|^{2}/n=o_{P}(1)\). Since \(\|\boldsymbol {\epsilon }\|^{2}/n\xrightarrow {P}\sigma ^{2}\), we obtain \(\hat {\sigma }_{S_{k}}^{2}=\sigma ^{2}+o_{P}(1)\). We can also show that \(\hat {\sigma }_{S_{k}\cup \{l\}}^{2}=\sigma ^{2}+o_{P}(1)\) for any \(l \in \mathcal {F}\cap {S_{k}^{c}}\) in the same way. Hence, we have \(\hat {\sigma }_{S_{k}}^{2}-\hat {\sigma }_{S_{k}\cup \{l\}}^{2}=o_{P}(1)\) for any \(l \in \mathcal {F}\cap {S_{k}^{c}}\). Now, let \(j^{*} \in \mathcal {F}_{c}\cap {S_{k}^{c}}\). Since \(-1<-(\hat \sigma ^{2}_{S_{k}}-\hat \sigma ^{2}_{S_{k}\cup \{j^{*}\}})(\hat \sigma ^{2}_{S_{k}})^{-1}\) and \(\log (1+x)\geq x(1+x)^{-1}\) for all x > − 1, we get

$$ \begin{array}{@{}rcl@{}} &&\text{EBIC}(S_{k}\cup\{j^{*}\})-\text{EBIC}(S_{k})\\ &&=n \log\left( 1-\frac{\hat\sigma^{2}_{S_{k}}-\hat\sigma^{2}_{S_{k}\cup\{j^{*}\}}}{\hat\sigma^{2}_{S_{k}}}\right) +(\log n + 2 \eta \log p)\\ &&\geq \left( -\frac{n\hat\sigma^{2}_{S_{k}}-n\hat\sigma^{2}_{S_{k}\cup\{j^{*}\}}}{\hat\sigma^{2}_{S_{k}}}\right)\left( 1-\frac{\hat\sigma^{2}_{S_{k}}-\hat\sigma^{2}_{S_{k}\cup\{j^{*}\}}}{\hat\sigma^{2}_{S_{k}}}\right)^{-1}+(\log n + 2 \eta \log p)\\ &&=\left( -\frac{\max_{l \in \mathcal{F}\cap {S_{k}^{c}}}\{n\hat\sigma^{2}_{S_{k}}-n\hat\sigma^{2}_{S_{k}\cup\{l\}}\}}{\hat\sigma^{2}_{S_{k}}}\right)\left( 1-\frac{\hat\sigma^{2}_{S_{k}}-\hat\sigma^{2}_{S_{k}\cup\{j^{*}\}}}{\hat\sigma^{2}_{S_{k}}}\right)^{-1}+(\log n + 2 \eta \log p). \end{array} $$
(A.40)

Note that \(\|\boldsymbol {H}_{lS_{k}}\boldsymbol {Y}\|^{2} \leq 2\|\boldsymbol {H}_{lS_{k}}\boldsymbol {\delta }\|^{2} +2\|\boldsymbol {H}_{lS_{k}}\boldsymbol {\epsilon }\|^{2}\). From Eq. A.22 and Proposition 3 of Zhang (2010), we have

$$ \max_{l \in \mathcal{F}\cap {S_{k}^{c}}} \{n\hat\sigma^{2}_{S_{k}}-n\hat\sigma^{2}_{S_{k}\cup\{l\}} \} = O_{P}(n^{1/5}\log n). $$
(A.41)

The first term of the right hand side in the last equality of Eq. A.41 is \(O_{P}(n^{1/5}\log n)\) because its denominator converges to one in probability. On the other hand, the second term in the same equality is \(O_{P}(n^{\xi _{0}})\), and \(O_{P}(n^{1/5-\xi _{0}}\log n)=o_{P}(1)\). Hence, we have

$$ \begin{array}{@{}rcl@{}} \text{EBIC}(S_{k}\cup\{j^{*}\})-\text{EBIC}(S_{k}) &\geq& (\log n +2\eta \log p)\{1- O_{P}(n^{1/5-\xi_{0}}\log n)\}\\ &=& (\log n +2\eta \log p )\{1-o_{P}(1)\}, \end{array} $$
(A.42)

when \(j^{*} \in \mathcal {F}_{c}\cap {S_{k}^{c}}\). Next, let \(j^{*} \in \mathcal {F}_{v} \cap {S_{k}^{c}}\). By the same argument, we get

$$ \begin{array}{@{}rcl@{}} &&\text{EBIC}(S_{k}\cup\{j^{*}\})-\text{EBIC}(S_{k}) \\ &&\geq \left( -\frac{\max_{l \in \mathcal{F}\cap {S_{k}^{c}}}\{n\hat\sigma^{2}_{S_{k}}-n\hat\sigma^{2}_{S_{k}\cup\{l\}}\}}{\hat\sigma^{2}_{S_{k}}}\right)\left( 1-\frac{\hat\sigma^{2}_{S_{k}}-\hat\sigma^{2}_{S_{k}\cup\{j^{*}\}}}{\hat\sigma^{2}_{S_{k}}}\right)^{-1}\\ &&\quad +(L-1)(\log n + 2 \eta \log p) \\ &&=(L-1)(\log n +2\eta \log p)\{1-O_{P}(n^{-\xi_{0}}\log n)\}\\ &&=(L-1)(\log n +2\eta \log p)\{1-o_{P}(1)\} . \end{array} $$
(A.43)

Eqs. A.42 and A.43 imply

$$ \text{EBIC}(S_{k}\cup\{j^{*}\})-\text{EBIC}(S_{k})>0 \ \ \text{for } j^{*}\in \mathcal{F}\cap {S_{k}^{c}}, $$
(A.44)

with probability tending to one. Eq. A.44 implies that our forward selection procedure stops variable selection at the k th step. The Proof of Theorem 3 is completed. □

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Shinkyu, A. Forward Selection for Feature Screening and Structure Identification in Varying Coefficient Models. Sankhya A (2021). https://doi.org/10.1007/s13171-021-00261-4

Download citation

Keywords

  • Varying coefficient model
  • B-spline
  • Screening consistency
  • Structure identification
  • BIC
  • Forward selection.

AMS (2000) subject classification

  • 62G08