Abstract
Variable selection and dimension reduction have been considered in nonparametric regression for improving the precision of estimation, via the formulation of a semiparametric multiple index model. However, most existing methods are ill-equipped to cope with a high-dimensional setting where the number of variables may grow exponentially fast with sample size. We propose a new procedure for simultaneous variable selection and dimension reduction in high-dimensional nonparametric regression problems. It consists essentially of penalised local polynomial regression, with the bandwidth matrix regularised to facilitate variable selection, dimension reduction and optimal estimation at the oracle convergence rate, all in one go. Unlike most existing methods, the proposed procedure does not require explicit bandwidth selection or an additional step of dimension determination using techniques like cross-validation or principal components. Empirical performance of the procedure is illustrated with both simulated and real data examples.
Similar content being viewed by others
References
Allen, G.I.: Automatic feature selection via weighted kernels and regularization. J. Comput. Graph. Stat. 22, 284–299 (2013)
Chen, X., Zou, C., Cook, R.D.: Coordinate-independent sparse sufficient dimension reduction and variable selection. Ann. Stat. 38, 3696–3723 (2010)
Chen, J., Zhang, C., Kosorok, M.R., Liu, Y.: Double sparsity kernel learning with automatic variable selection and data extraction. Stat. Interface 11, 401–420 (2018)
Conn, D., Li, G.: An oracle property of the Nadaraya–Watson kernel estimator for high dimensional nonparametric regression. Scand. J. Stat. 46, 735–764 (2019)
Cook, R.D., Li, B.: Dimension reduction for conditional mean in regression. Ann. Stat. 30, 455–474 (2002)
Cook, R.D., Weisberg, S.: Comment on “sliced inverse regression for dimension reduction’’. J. Am. Stat. Assoc. 86, 328–332 (1991)
Fan, J., Gijbels, I.: Local Polynomial Modelling and Its Applications. Chapman and Hall, London (1996)
Giordano, F., Lahiri, S.N., Parrella, M.L.: GRID: A variable selection and structure discovery method for high dimensional nonparametric regression. Ann. Stat. 48, 1848–1874 (2020)
Jiang, R., Qian, W.M., Zhou, Z.G.: Single-index composite quantile regression with heteroscedasticity and general error distributions. Stat. Pap. 57, 185–203 (2016)
Lafferty, J., Wasserman, L.: Rodeo: sparse, greedy nonparametric regression. Ann. Stat. 36, 28–63 (2008)
Li, K.C.: Sliced inverse regression for dimension reduction. J. Am. Stat. Assoc. 86, 316–327 (1991)
Li, K.C.: On principal Hessian directions for data visualization and dimension reduction: another application of Stein’s lemma. J. Am. Stat. Assoc. 87, 1025–1039 (1992)
Li, L.: Sparse sufficient dimension reduction. Biometrika 94, 603–613 (2007)
Li, B., Dong, Y.: Dimension reduction for nonelliptically distributed predictors. Ann. Stat. 37, 1272–1298 (2009)
Li, L., Cook, R.D., Nachtsheim, C.J.: Model-free variable selection. J. R. Stat. Soc. Ser. B 67, 285–299 (2005)
Rekabdarkolaee, H.M., Wang, Q.: Variable selection through adaptive MAVE. Stat. Probab. Lett. 128, 44–51 (2017)
van der Vaart, A.W., Wellner, J.A.: Weak Convergence and Empirical Processes. Springer, New York (1996)
Wang, Q., Yin, X.: A nonlinear multi-dimensional variable selection method for high dimensional data: sparse MAVE. Comput. Stat. Data Anal. 52, 4512–4520 (2008)
Wang, T., Xu, P., Zhu, L.: Penalized minimum average variance estimation. Stat. Sin. 23, 543–569 (2013)
White, K.R., Stefanski, L.A., Wu, Y.: Variable selection in kernel regression using measurement error selection likelihoods. J. Am. Stat. Assoc. 112, 1587–1597 (2017)
Wu, W., Hilafu, H., Xue, Y.: Simultaneous estimation for semi-parametric multi-index models. J. Stat. Comput. Simul. 89, 2354–2372 (2019)
Xia, Y.: Asymptotic distributions for two estimators of the single-index model. Econometric Theory 22, 1112–1137 (2006)
Xia, Y., Tong, H., Li, W.K., Zhu, L.X.: An adaptive estimation of dimension reduction space. J. R. Stat. Soc. Ser. B 64, 363–410 (2002)
Yu, P., Du, J., Zhang, Z.: Single-index partially functional linear regression model. Stat. Pap. 61, 1107–1123 (2020)
Zhang, J.: Estimation and variable selection for partial linear single-index distortion measurement errors models. Stat. Pap. 62, 887–913 (2021)
Zhao, W., Zhang, F., Li, R., Lian, H.: Principal single-index varying-coefficient models for dimension reduction in quantile regression. J. Stat. Comput. Simul. 90, 800–818 (2020)
Zhao, W., Li, R., Lian, H.: High-dimensional quantile varying-coefficient models with dimension reduction. Metrika 85, 1–19 (2022)
Author information
Authors and Affiliations
Contributions
Lee designed and directed the research. Cheung wrote the main manuscript and performed the numerical studies. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 I Lemmas and proofs
Consider a fixed subset \(\mathcal {B}\subset \{1,\ldots ,D\}\) with \(|\mathcal {B}|\) bounded, and a bandwidth vector \(\pmb {h}=(h_1,\ldots ,h_D)^\top \in (0,\infty ]^D\) satisfying \(\bar{h}\equiv \max _{d\in \mathcal {B}}h_d=o(1)\) and \(h_d^{-1}=O(1)\) for \(d\notin \mathcal {B}\). Define, for \(\mathcal {T}\in \mathbb {R}^{r\times D}\) and \(i=1,\ldots ,n\), \(w_{i,\mathcal {T}}=Y_i-m_{\mathcal {T}}(\mathcal {T}\pmb {X}_i)\). For \(\pmb {x}\in \mathbb {R}^D\), \(T\in \mathscr {T}^\circ \), \(\pmb {\gamma }\in \mathbb {Z}_+^D\), \(r\ge 1\) and any function g on \(\mathbb {R}^D\), define
We state four technical lemmas before presenting our main proofs.
Lemma 1 establishes asymptotic expansions for the mean and variance of a general kernel-weighted sample average commonly found in local polynomial regression, with index-specific bandwidths set to be \(\pmb {h}\).
Lemma 1
Let \(\pmb {\gamma }\in \pmb {\Gamma }^{2p+1}\), \(T\in \mathscr {T}^\circ \) and \(\pmb {x}\in \mathbb {R}^D\) be fixed. Let g be a \((\Delta (\Vert \pmb {\gamma }_{\mathcal {B}}\Vert _1)+1)\) times differentiable function on \(\mathbb {R}^D\), with \(\mathbb {E}\left[ \Vert T\pmb {X}\Vert _2^{2p+1}|g(T\pmb {X})|\right] <\infty \). Let \(W_1,\ldots ,W_n\) be independent random variables such that \(\mathbb {E}[W_i^r|T\pmb {X}_1,\ldots ,T\pmb {X}_n]=\omega _r\), for \(r=1,2\) and \(i=1,\ldots ,n\). Then we have
and
Proof of Lemma 1
Noting that
(S.1) follows by Taylor expanding \(f_Tg\) in powers of \(\pmb {h}_{\mathcal {B}}\), that is
term-by-term integration and (A2). The result (S.2) follows by noting that
and that \(n^{-1}\omega _1^2=O\left( n^{-1}\pmb {h}_\mathcal {B}^{-\varvec{1}_{\mathcal {B}}}\bar{h}\right) \). \(\square \)
In particular, we deduce by setting \(g(\cdot )\equiv 1\) and \(W_i\equiv 1\) for all i in Lemma 1 that
Since \(\mathcal {K}_{\mathcal {B}}^1(\pmb {x};f_T,\pmb {\gamma },T)\) does not depend on \(\pmb {\gamma }\) for \(\pmb {\gamma }\in \pmb {\Gamma }^p_{\mathcal {B}}\), we can construct, by (A1), an inverse \(\{\breve{\kappa }_{\pmb {\gamma },\pmb {\psi }}(\pmb {x};T):\pmb {\gamma },\pmb {\psi }\in \pmb {\Gamma }^p_{\mathcal {B}}\}\) such that
Setting \(\pmb {Q}=\textrm{diag}(\pmb {h})^{-1}T\), the local coefficient estimators \({\hat{\beta _{\pmb {\gamma }}}}^{-\emptyset }(\pmb {x};\pmb {Q})\) defined in (1) solve the following system of equations for \(\beta _{\pmb {\gamma }}\):
where \(e(a)=\mathbb {I}\{a>0\}-\mathbb {I}\{a<0\}\) for \(a\ne 0\) and \(|e(0)|\le 1\). For \(\pmb {\gamma }\in \pmb {\Gamma }^p\setminus \pmb {\Gamma }^p_{\mathcal {B}}\), \({\hat{\beta _{\pmb {\gamma }}}}^{-\emptyset }(\pmb {x};\pmb {Q})\) is associated with at least one index which has a large bandwidth \(h_d\) with \(h_d^{-1}=O(1)\), and is therefore heavily penalised. The following lemma states the order of such estimators.
Lemma 2
For any \(\pmb {x}\in \mathbb {R}^D\) and \(T=\textrm{diag}(\pmb {h})\pmb {Q}\in \mathscr {T}^\circ \), we have
Proof of 2
Let \(M_\beta (\pmb {x})=\max \big \{|{\hat{\beta }}^{-\emptyset }_{\pmb {\gamma }}(\pmb {x};\pmb {Q})|:\pmb {\gamma }\in \pmb {\Gamma }^p\setminus \pmb {\Gamma }^p_{\mathcal {B}}\big \}\). For \(\alpha =1\), it follows by minimality of \(\big \{{\hat{\beta }}^{-\emptyset }_{\pmb {\gamma }}(\pmb {x};\pmb {Q}):\pmb {\gamma }\in \pmb {\Gamma }^p\big \}\) that
Thus, (S.4) holds for \(\alpha =1\). For \(\alpha >1\), we have by (S.3) that
For \(\alpha \in (1,2]\), (S.5) reduces to \(M_\beta (\pmb {x})^{\alpha -1} =O_p(n^{-1}D^{-p}+n^{-1}M_\beta (\pmb {x}))\), so that \(M_\beta (\pmb {x})=O_p(n^{-1}D^{-p})\). For \(\alpha >2\), we have by (S.5) either \(M_\beta (\pmb {x})=O_p (n^{-1}D^{-p})\) or \(M_\beta (\pmb {x}) =O_p(n^{-(\alpha -1)/(\alpha -2)}D^{-p})\). It follows that (S.4) also holds for any \(\alpha >1\). \(\square \)
Define, for \(i=1,\ldots ,n\), \(\pmb {\gamma }\in \pmb {\Gamma }^p_{\mathcal {B}}\), \(\pmb {\psi }\in \pmb {\Gamma }^{p+1}_{\mathcal {B}}\) and \(\pmb {x},\pmb {y}\in \mathbb {R}^D\),
The next lemma establishes asymptotic expansions for the regression function estimator \({\hat{\beta _{\varvec{0}}}}^{-\emptyset }(\pmb {x};\pmb {Q})\) obtained by setting the index-specific bandwidths and the direction matrix to be \(\pmb {h}\) and T, respectively.
Lemma 3
For any \(\pmb {x}\in \mathbb {R}^D\) and \(T=\textrm{diag}(\pmb {h})\pmb {Q}\in \mathscr {T}^\circ \), we have
Suppose further that \(|\mathcal {B}|\ge r_0\). Then we have
with
Proof of Lemma 3
Noting \(\sum _{\pmb {\psi }\in (\pmb {\Gamma }^p\backslash \pmb {\Gamma }^p_{\mathcal {B}})}\pmb {h}_\mathcal {B}^{\pmb {\gamma }+\pmb {\psi }_{\mathcal {B}}}\kappa _{\pmb {\psi }}(\pmb {x};T)\beta _{\pmb {\psi }}\) \(=O_p(n^{-1})\) and
for any \(\pmb {\gamma }\in \pmb {\Gamma }^p_{\mathcal {B}}\), by (S.3), we have
Applying Lemma 1, we have \(\sum _{i=1}^n\nu _{\varvec{0},i}(\pmb {x};\pmb {Q})w_{i,T_{\mathcal {B}\cdot }}=\Omega _p\big (n^{-1/2}\pmb {h}_{\mathcal {B}}^{-(1/2)\varvec{1}_{\mathcal {B}}}\big )\) and
so that the first expression for \({\hat{\beta _{\varvec{0}}}}^{-\emptyset }(\pmb {x};\pmb {Q})\) follows from (S.7).
Consider next the case \(|\mathcal {B}|\ge r_0\). The expansion (S.6) holds by substituting
in (S.7). That \(R^*_{\varvec{0}}(\pmb {x};\pmb {Q})=\Omega _p(\bar{h}^{p^*}+n^{-1/2}\bar{h}^{p+1}\pmb {h}_\mathcal {B}^{-(1/2)\varvec{1}_{\mathcal {B}}})\) follows from the fact that \(\mathbb {E}\big [R^*_{\varvec{0}}(\pmb {x};\pmb {Q})\big ]=\Omega _p(\bar{h}^{p^*})\) and \(\textrm{Var}(R^*_{\varvec{0}}(\pmb {x};\pmb {Q}))=\Omega _p(n^{-1/2}\bar{h}^{p+1}\pmb {h}_\mathcal {B}^{-(1/2)\varvec{1}_{\mathcal {B}}})\). And \(\sum _{i=1}^n\nu _{\varvec{0},i}(\pmb {x};\pmb {Q})w_{i,T_\mathcal {B\cdot }}=\Omega _p(n^{-1/2}\pmb {h}_{\mathcal {B}}^{-(1/2)\varvec{1}_{\mathcal {B}}})\) follows from the fact that expectation and variance of \(\sum _{i=1}^n\nu _{\varvec{0},i}(\pmb {x};\pmb {Q})w_{i,T_\mathcal {B\cdot }}\) equal to zero and \(\Omega _p(n^{-1}\pmb {h}_{\mathcal {B}}^{-\varvec{1}_{\mathcal {B}}})\) respectively by Lemma 1. \(\square \)
Define, for \(j\notin S_k\) and \(\pmb {\gamma }\in \pmb {\Gamma }_{\mathcal {B}}^p\), \(R^{*-S_k}_{\pmb {\gamma }}\) and \(\nu ^{-S_k}_{\pmb {\gamma },j}\) to be the counterparts of \(R^*_{\pmb {\gamma }}\) and \(\nu _{\pmb {\gamma },j}\), respectively, evaluated on the sample obtained by removing the observations \(\big \{(\pmb {X}_i,Y_i): i \in S_k\big \}\). The next lemma establishes an expansion for the K-fold cross-validated squared prediction error.
Lemma 4
Suppose that \(|\mathcal {B}|\ge r_0\) and (A5) holds. Let \(\tilde{\mathcal {C}}\supset \mathcal {A}\) be a fixed subset in \(\{1,\ldots ,D\}\) with \(|\tilde{\mathcal {C}}|\) bounded. Then we have
uniformly over \(T=\textrm{diag}(\pmb {h})\pmb {Q}\in \mathscr {T}^\circ \) satisfying \(\Vert T_{\mathcal {B},\{d\}}\Vert _2=\Omega _p(\mathbb {I}\{d\in \tilde{\mathcal {C}}\})\), \(d=1,\ldots ,D\), and \(d_{\mathcal {B}}(T)\) sufficiently small, where
Moreover, we have in general
provided that \(m_{T^*}\) is not functionally related to the distribution function of \((\pmb {X},Y)\).
Proof of Lemma 4
Using Lemma 3, the cross-validated squared error has the expansion
To prove (S.8), it suffices to show that \(n_0^{-1}\sum _{i\in S_1}\mathscr {R}_{i,1}(T)^2=\Omega _p\big (\bar{h}^{2p^*}+n^{-1}\pmb {h}_{\mathcal {B}}^{-\varvec{1}_{\mathcal {B}}}+d_{\mathcal {B}}(T)^2\big )\). Note that for \(T,T'\in \mathscr {T}^\circ \) with \(T_{\mathcal {B}\cdot }\ne T'_{\mathcal {B}\cdot }\), \(|\mathscr {R}_{i,1}(T)-\mathscr {R}_{i,1}(T')|/\Vert T_{\mathcal {B}\cdot }-T'_{\mathcal {B}\cdot }\Vert _1\) has uniformly bounded moments. Applying Example 19.7 in, we have, for large n and some sufficiently small constant \(k_0\),
uniformly over \(T\in \mathscr {T}^\circ \). Noting that
we have in general that
Similarly, the tail bound \(\mathbb {P}\big (|n_0^{-1}\sum _{i\in S_1}\mathscr {R}_{i,1}(T)w_{i,T^*}|\ge t\big )\le k_0^{-1}e^{-k_0nt^2/\mathbb {E}[\mathscr {R}_{i,1}(T)^2]}\) implies that
uniformly over \(T\in \mathscr {T}^\circ \). The lemma then follows from (S.9), (S.10) and (S.11). \(\square \)
1.2 II Proof of Theorem 1
It follows from (A5) that
Setting \(\pmb {A}=n^{1/(2p^*+r_0)}T^*\), it follows by (S.12) and Lemma 4 that
Define \(r^\dag =\big |\{d:{\hat{h}}_d=o_p(1)\}\big |\) and \( \mathcal {C}=\{d:\Vert \hat{T}_{\{1,\dots ,r^\dag \},\{d\}}\Vert _2=\Omega _p(1)\}\). Using Lemma 3 and comparing the objective values in (2) with \((\mathcal {B},\pmb {Q})\) set to be \(\big (\{1,\ldots ,r_0\},\pmb {A}\big )\) and \(\big (\{1,\ldots ,r^\dag \},\hat{\pmb {Q}}\big )\), respectively, we have
It follows that \(m_{\hat{T}_{\{1,\dots ,r^\dag \}\cdot }}(\hat{T}_{\{1,\dots ,r^\dag \}\cdot }\,\pmb {X})=m_{T^*}(T^*\pmb {X})+o(1)\) almost surely. This happens only if \(r^\dag \ge r_0\) and \(\mathcal {C}\supset \mathcal {A}\). We assume for technical convenience that \(\hat{T}_{\{1,\ldots ,r^\dag \},\mathcal {C}^c}\) is a zero matrix with probability converging to one. This condition can be ensured, for example, by a simple thresholding step to reset the elements in \(\hat{T}_{\hat{\mathcal {I}},\hat{\mathcal {A}}^c}\) to zero. Our empirical experience reveals that the effects of such thresholding are negligibly small. Setting \((\mathcal {B},\tilde{\mathcal {C}})=\big (\{1,\ldots ,r^\dag \},\mathcal {C}\big )\) and writing \(\hat{\mathscr {R}}_{i,k}\) for \(\mathscr {R}_{i,k}\) with \(\pmb {h}\) replaced by \(\hat{\pmb {h}}\), we have
Noting that \(r^\dag \ge r_0\), \(|\mathcal {C}|\ge |\mathcal {A}|\) and \(n^{-2p^*/(2p^*+r_0)}=o\big (\min \{\lambda _1,\lambda _2\}\big )\), (S.13) implies that \(r^\dag =r_0\) and \(\mathcal {C}=\mathcal {A}\). Noting that \(\Vert \hat{\pmb {Q}}_{\cdot d}\Vert _2^{-2p^*}=O_p\big (\max _{d'\le r_0}\hat{h}_{d'}^{2p^*}\big )\) for \(d\in \mathcal {A}\), we have, by (S.8) and (S.13), that
so that \(O_p\big ( n^{-2p^*/(2p^*+r_0)}\big )\ge \Omega _p\big (\max _{d\le r_0}\{\hat{h}^{2p^*}_d\}+n^{-1}\) \(\hat{\pmb {h}}_{\{1,\dots ,r_0\}}^{-\varvec{1}_{\{1,\dots ,r_0\}}}\big )>0\) for general \(m_{T^*}\) functionally unrelated to the distribution function of \((\pmb {X},Y)\). This implies \(\hat{h}_d=\Omega _p(n^{-1/(2p^*+r_0)})\) for \(d\le r_0\) and \(d_{\{1,\dots ,r_0\}}({\hat{T}})=O_p\) \((n^{-p^*/(2p^*+r_0)})\), which proves parts (i) and (ii). Part (iv) then follows by noting that \(n^{1/(2p^*+r_0)}=\Omega _p\big (\min _{d'\le r_0}\hat{h}_{d'}^{-1}\big )=O_p\big ( \Vert \hat{\pmb {Q}}_{\cdot d}\Vert _2\big )\) for \(d\in \mathcal {A}\). For \(d>r_0\), we have, by (ii) and (S.13), that
which proves part (iii). Part (v) follows by applying similar arguments to the penalty terms \(\lambda _2\big (1+\Vert \hat{\pmb {Q}}_{\cdot d}\Vert _2^{-2p^*}\big )^{-1}\), \(d\not \in \mathcal {A}\), in (S.13). \(\square \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Cheung, K.Y., Lee, S.M.S. High-dimensional local polynomial regression with variable selection and dimension reduction. Stat Comput 34, 1 (2024). https://doi.org/10.1007/s11222-023-10308-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-023-10308-1