Skip to main content
Log in

Model-free feature screening for ultrahigh-dimensional data conditional on some variables

  • Published:
Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Abstract

In this paper, the conditional distance correlation (CDC) is used as a measure of correlation to develop a conditional feature screening procedure given some significant variables for ultrahigh-dimensional data. The proposed procedure is model free and is called conditional distance correlation-sure independence screening (CDC-SIS for short). That is, we do not specify any model structure between the response and the predictors, which is appealing in some practical problems of ultrahigh-dimensional data analysis. The sure screening property of the CDC-SIS is proved and a simulation study was conducted to evaluate the finite sample performances. Real data analysis is used to illustrate the proposed method. The results indicate that CDC-SIS performs well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Fan, J., Gijbels, I. (1996). Local polynomial modelling and its applications, Monographs on Statistics and Applied Probability, vol. 66. Chapman and Hall, London.

  • Fan, J., Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5), 849–911.

  • Fan, J., Song, R. (2010). Sure independence screening in generalized linear models with np-dimensionality. The Annals of Statistics, 38(6), 3567–3604.

  • Fan, J., Samworth, R., Wu, Y. (2009). Ultrahigh dimensional feature selection: beyond the linear model. The Journal of Machine Learning Research, 10, 2013–2038.

  • Fan, J., Feng, Y., Song, R. (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association, 106(494), 544–557.

  • Fan, J., Ma, Y., Dai, W. (2014). Nonparametric independence screening in sparse ultra-high-dimensional varying coefficient models. Journal of the American Statistical Association, 109(507), 1270–1284.

  • Harrison, D., Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management, 5(1), 81–102.

  • Li, R., Zhong, W., Zhu, L. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107(499), 1129–1139.

  • Liu, J., Li, R., Wu, R. (2014). Feature selection for varying coefficient models with ultrahigh-dimensional covariates. Journal of the American Statistical Association, 109(505), 266–274.

  • Székely, G. J., Rizzo, M. L., Bakirov, N. K. (2007). Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6), 2769–2794.

  • Wang, Q. H., Rao, J. N. K. (2002). Empirical likelihood-based inference under imputation for missing response data. The Annals of Statistics, 30(3), 896–924.

  • Wang, X., Pan, W., Hu, W., Tian, Y., Zhang, H. (2015). Conditional distance correlation. Journal of the American Statistical Association, 110(512), 1726–1734.

  • Zhong, W., Zhu, L., Li, R., Cui, H. (2016). Regularized quantile regression and robust feature screening for single index models. Statistica Sinica, 26(1), 69–95.

  • Zhu, L. P., Li, L., Li, R., Zhu, L. X. (2011). Model-free feature screening for ultrahigh-dimensional data. Journal of the American Statistical Association, 106(496), 1464–1475.

Download references

Acknowledgements

Wang’s research was supported by the National Natural Science Foundation of China (General Program 11171331 and Key Program 11331011) and the National Natural Science Foundation for Creative Research Groups in China (61621003), a Grant from the Key Lab of Random Complex Structure and Data Science, CAS and Natural Science Fund of SZU.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qihua Wang.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 230 KB)

Appendix

Appendix

We first establish the following regularity conditions:

  1. (C1)

    Denote the density function of W by \(f(\cdot )\), and assume that it has continuous second derivatives. The support of W is assumed to be bounded and is denoted by \(\mathcal {W}=[a,b]\) with finite constants a and b.

  2. (C2)

    \(K(\cdot )\) is a symmetric density function with bounded support and bounded over its support.

  3. (C3)

    The random variables \(\mathbf X \) and Y satisfy the sub-exponential tail probability uniformly in p. That is, there exists a positive constant \(s_0\), such that for \(0\le s<s_0\),

    $$\begin{aligned} \sup _{W\in \mathcal {W}}\max _{1\le j \le p}E(\exp (sX_j^2|W))< & {} \infty ,\\ \sup _{W\in \mathcal {W}}E(\exp (s Y^2|W))< & {} \infty , \end{aligned}$$
  4. (C4)

    \(\min _{j\in \mathcal {{M}}^{*}}{\rho }_{j0}^{*} \ge 2cn^{-\kappa }\) for some constant \(c>0\) and \(0\le \kappa < 1/2\).

Proof of Theorem 1

The proof consists of three steps. We denote the positive constants c and C as generic constants depending on the context, which can vary from line to line.

  1. Step 1.

    For some \(0\le \kappa <1/2\), we first prove

    $$\begin{aligned}&\max _{1\le j\le p}\sup _{w\in [a,b]}P(|\hat{\rho }^2 (X_j,Y|W=w)-{\rho }^2 (X_j,Y|W=w)|\nonumber \\ {}&\quad \ge cn^{-\kappa }) \le C \exp \left( -\frac{n^{-\kappa }}{Ch}\right) . \end{aligned}$$
    (7)

    Refer to the Supplemental material for the proof of Step 1.

  2. Step 2.

    We prove \(P(\max _{1\le j\le p}|\hat{\rho }_{j}^{*}-{\rho }_{j0}^{*}|\ge cn^{-k}) \le O(np\exp (-{n^{\gamma -\kappa }}/{\xi }))\). Note that

    $$\begin{aligned} P(|\hat{\rho }_{j}^{*}-{\rho }_{j0}^{*}|\ge cn^{-\kappa })\le & {} P(|\hat{\rho }_{j}^{*}-{\rho }_{j}^{*}|+|{\rho }_{j}^{*}-{\rho }_{j0}^{*}| \ge cn^{-\kappa })\\\le & {} P(|\hat{\rho }_{j}^{*}-{\rho }_{j}^{*}| \ge cn^{-\kappa }/2)+ P(|{\rho }_{j}^{*}-{\rho }_{j0}^{*}| \ge cn^{-\kappa }/2). \end{aligned}$$

    By the definitions of \(\hat{\rho }_{j}^{*}\), \({\rho }_{j}^{*}=\frac{1}{n}\sum _{i=1}^n{\rho }_j^2 (W_i)\) with \({\rho }_j^2 (w)=\rho ^2(X_j,Y|W=w)\) and the result of Step 1, we have, for \(j=1,2,\ldots ,p\)

    $$\begin{aligned} P(|\hat{\rho }_{j}^{*}-{\rho }_{j}^{*}| \ge cn^{-\kappa }/2)= & {} P\left( |\frac{1}{n}\sum _{i=1}^n\hat{\rho }_j^2 (W_i)-\frac{1}{n}\sum _{i=1}^n{\rho }_j^2 (W_i)| \ge cn^{-\kappa }/2\right) \nonumber \\\le & {} \sum _{i=1}^n P(|\hat{\rho }_j^2 (W_i)-{\rho }_j^2 (W_i)| \ge cn^{-\kappa }/2)\nonumber \\\le & {} Cn\exp \left( -\frac{n^{-\kappa }}{Ch}\right) \nonumber \\= & {} O(n\exp (-n^{\gamma -\kappa }/\xi )), \end{aligned}$$
    (8)

    where \(\xi \) is a positive constant, and \(0\le \kappa <\gamma \). By Hoeffding’s inequality, for \(j=1,2,\ldots ,p\), it follows that

    $$\begin{aligned} P(|{\rho }_{j}^{*}-{\rho }_{j0}^{*}| \ge cn^{-\kappa }/2)= & {} P\left( |\frac{1}{n}\sum _{i=1}^n{\rho }_j^2 (W_i)-E{\rho }_j^2 (W_i)| \ge cn^{-\kappa }/2\right) \nonumber \\\le & {} 2\exp (-nc^2n^{-2\kappa }/2))= O(\exp (-n^{1-2\kappa }/\xi )).\nonumber \\ \end{aligned}$$
    (9)

    Eq. (8) dominates Eq. (9). Hence, for \(j=1,2,\ldots ,p\), we get

    $$\begin{aligned} P(|\hat{\rho }_{j}^{*}-{\rho }_{j0}^{*}|\ge cn^{-\kappa }) \le O(n\exp (-{n^{\gamma -\kappa }}/{\xi })). \end{aligned}$$

    We thus have

    $$\begin{aligned} P\left( \max _{1\le j\le p}|\hat{\rho }_{j}^{*}-{\rho }_{j0}^{*}|\ge cn^{-k}\right) \le O(np\exp (-{n^{\gamma -\kappa }}/{\xi })). \end{aligned}$$
  3. Step 3.

    We prove \(P(\mathcal {{M}}^{*} \subset \mathcal {\hat{M}}) \ge 1-O(ns_n\exp (-{n^{\gamma -\kappa }}/{\xi }))\). If \(\mathcal {{M}}^{*} \not \subset \mathcal {\hat{M}}\), then there exist some \(j\in \mathcal {{M}}^{*}\) such that \(\hat{\rho }_{j}^{*}<cn^{-\kappa }\), due to \(\min _{j\in \mathcal {{M}}^{*}}{\rho }_{j0}^{*} \ge 2cn^{-\kappa }\), \(|\hat{\rho }_{j}^{*}-{\rho }_{j0}^{*}|\ge cn^{-\kappa }\) for some \(j\in \mathcal {{M}}^{*}\), indicating that

    $$\begin{aligned} \{\mathcal {{M}}^{*} \not \subset \mathcal {\hat{M}}\} \subset \{|\hat{\rho }_{j}^{*}-{\rho }_{j0}^{*}|\ge cn^{-\kappa }\quad \text{ for } \text{ some } \; j\in \mathcal {{M}}^{*}\}. \end{aligned}$$

    Consequently,

    $$\begin{aligned} P\{\mathcal {{M}}^{*} \subset \mathcal {\hat{M}}\}\ge & {} P\{\max _{j\in \mathcal {{M}}^{*}}|\hat{\rho }_{j}^{*}-{\rho }_{j0}^{*}|< cn^{-\kappa }\}\\= & {} 1-P\{\max _{j\in \mathcal {{M}}^{*}}|\hat{\rho }_{j}^{*}-{\rho }_{j0}^{*}|\ge cn^{-\kappa }\}\\\ge & {} 1-s_n P\{|\hat{\rho }_{j}^{*}-{\rho }_{j0}^{*}|\ge cn^{-\kappa }\}\\\ge & {} 1 - O(ns_n\exp (-{n^{\gamma -\kappa }}/{\xi })). \end{aligned}$$

\(\square \)

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Y., Wang, Q. Model-free feature screening for ultrahigh-dimensional data conditional on some variables. Ann Inst Stat Math 70, 283–301 (2018). https://doi.org/10.1007/s10463-016-0597-2

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10463-016-0597-2

Keywords

Navigation