Abstract
With the explosion of digital information, high-dimensional data is frequently collected in prevalent domains, in which the dimension of covariates can be much larger than the sample size. Many effective methods have been developed to reduce the dimension of such data recently, however, few methods might perform well for survival data with censoring. In this article, we develop a novel nonparametric feature screening procedure based on ultrahigh-dimensional survival data by incorporating the inverse probability weighting scheme to tackle the issue of censoring. The proposed method is model-free and hence can be implemented for extensive survival models. Moreover, it is robust to heterogeneity and invariant to monotone increasing transformations of the response. The sure screening property and ranking consistency property are also established under mild conditions. The competence and robustness of our method is further confirmed through comprehensive simulation studies and an analysis of a real data example.
Similar content being viewed by others
References
Bair E, Tibshirani R (2004) Semi-supervised methods to predict patient survival from gene expression data. PLoS Biol 2:511–522
Barut E, Fan J, Verhasselt A (2016) Conditional sure independence screening. J Am Stat Assoc 111:1266–1277
Bitouzé D, Laurent B, Massart P (1999) A Dvoretzky–Kiefer–Wolfowitz type inequality for the Kaplan–Meier estimator. Annales de I’Institut Henri Poincaré 35:735–763
Cox DR (1972) Regression models and life-tables (with discussion). J R Stat Soci Ser B 34:187–220
Dabrowska DM, Doksum KA (1988) Estimation and testing in a two-sample generalized odds-rate model. J Am Stat Assoc 83:744–749
Fan J, Li R (2002) Variable selection for Cox’s proportional hazards model and frailty model. Ann Stat 30:74–99
Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space (with disscussion). J R Stat Soc Ser B 70:849–911
Fan J, Song R (2010) Sure Independence screening for in generalized linear models with NP-dimensionality. Ann Stat 38:3567–3604
Fan J, Samworth R, Wu Y (2009) Ultrahigh dimensional feature selection: beyond the linear model. J Mach Learn Res 10:2013–2038
Fan J, Feng Y, Wu Y (2010) High-dimensional variable selection for Cox’s proportional hazards. Borrow Strength Theory Power Appl A Festschr Lawrence D. Brown 6:70–86
Fan J, Feng Y, Song R (2011) Nonparametric independence screening in sparse ultra-high dimensional additive models. J Am Stat Assoc 106:544–557
Fan J, Ma Y, Dai W (2014) Nonparametric independence screening in sparse ultra-high dimensional varying coefficient models. J Am Stat Assoc 109:1270–1284
Gorst-Rasmussen A, Scheike T (2013) Independent screening for single-index hazard rate models with ultra-high-dimensional dimensional features. J R Stat Soc Ser B 75:217–245
He X, Wang L, Hong HG (2013) Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Ann Stat 41:342–369
Hong HG, Kang J, Li Y (2018) Conditional screening for ultra-high dimensional covariates with survival outcomes. Lifetime data analysis 24:45–71
Huang J, Horowitz JL, Ma S (2008) Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann Stat 36:587–613
Jin J, Zhang CH, Zhang Q (2014) Optimality of graphlet screening in high dimensional variable selection. J Mach Learn Res 15:2723–2772
Kendall MG (1962) Rank correlation methods, 3rd edn. Griffin & Co, London
Li R, Zhong W, Zhu LP (2012) Feature screening via distance correlation learning. J Am Stat Assoc 107:1129–1139
Lin HZ, Peng H (2013) Smoothed rank correlation of the linear transformation regression model. Comput Stat Data Anal 57:615–630
Li G, Peng H, Zhang J, Zhu LX (2012) Robust rank correlation based screening. Ann Stat 40:1846–1877
Lu W, Zhang HH (2007) Variable selection for proportional odds model. Stat Med 26:3771–3781
Ma S, Li R, Tsai CL (2017) Variable screening via quantile partial correlation. J Am Stat Assoc 112:650–663
Peng L, Fine J (2009) Competing risks quantile regression. J Am Stat Assoc 104:1440–1453
Rosenwald A, Wright G, Chan WC, Connors JM, Hermelink HK, Smeland EB, Staudt LM (2002) The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N Engl J Med 346:1937–1947
Serfling RJ (1980) Approximation theorems of mathematical statistics. Wiley, New York
Shi P, Qu A (2017) Weak signal identification and inference in penalized model selection. Ann Stat 45:1214–1253
Song R, Lu W, Ma S, Jeng XJ (2014) Censored rank independence screening for high-dimensional survival data. Biometrika 101:799–814
Tibshirani RJ (1997) The lasso method for variable selection in the Cox model. Stat Med 16:385–395
Uno H, Cai T, Pencina MJ, D’Agostino RB, Wei LJ (2011) On the Cstatistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Stat Med 30:1105–1117
Wu Y, Yin G (2015) Conditional quantile screening in ultrahigh-dimensional heterogeneous data. Biometrika 102:65–76
Zeng D, Lin DY (2007) Maximum likelihood estimation in semiparametric regression models with censored data. J R Stat Soc Ser B 69:507–564
Zhang J, Liu Y, Wu Y (2017) Correlation rank screening for ultrahigh-dimensional survival data. Comput Stat Data Anal 2017:121–132
Zhao SD, Li Y (2012) Principled sure independence screening for Cox models with ultra-high-dimensional covariates. J Multivar Anal 105:397–4114
Zhou T, Zhu LP (2017) Model-free feature screening for ultrahigh dimensional censored regression. Stat Comput 27:947–961
Zhu LP, Li L, Li R, Zhu LX (2011) Model-free feature screening for ultrahigh dimensional data. J Am Stat Assoc 106:1464–1475
Acknowledgements
The authors thank the Editor, an Associate Editor and the anonymous reviewers for their constructive suggestions, which have helped greatly improve our paper. Pan’s work was supported by Graduate Innovation Foundation of Shanghai University of Finance and Economics of China (CXJJ-2015-448). Zhou’s work was supported by the State Key Program of National Natural Science Foundation of China (71331006), the State Key Program in the Major Research Plan of National Natural Science Foundation of China (91546202).
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A
Lemma 1
(Bitouzé et al. 1999, Theorem 1) Let \(\{X_i\}_{i=1}^n\) and \(\{Y_i\}_{i=1}^n\) be independent sequences of independent identically distributed nonnegative random variables with distribution functions \(F(\cdot )\) and \(G(\cdot )\). Let \(\widehat{F}_{n}\) be the Kaplan–Meier estimator of \(F(\cdot )\). There exists a constant \(M>0\), for any \(\lambda >0\), such that
Lemma 2
(Serfling 1980, P201, Theorem B) Let \(h = h(X_1,X_2,\ldots ,X_m)\) be the kernel of the U-statistic, \(\theta = \theta (F)\), with \(E\exp \left\{ sh(X_1,X_2,\ldots ,X_m)\right\} <\infty \), \(0<s<s_0\). For any \(\varepsilon >o\), when \(n>m\), there exist \(c_1>0\) and \(0<\rho <1\) such that
Lemma 3
Under Condition (C1), for any \(c_2 >0\), when \(n \ge M^2c_3^{-1}\), where \(\delta >0\) and \(c_3 = \frac{1}{9}(\frac{c_2}{1+c_2})^2\delta ^8\),
Proof of Lemma 3
For any x, \(y>0\), taking \(a_1=\frac{c_2}{1+c_2}\), i.e., \(a_1\in (0,1)\), it is easy to show that
Let \(S(t)= 1- F(t)= Pr(T> t)\). Condition (C1) implies that there exist a constant \(\delta >0\), such that \(\delta \le S(Y_i) \le 1\), \(\delta \le K(Y_i) \le 1\) and \(0\le \widehat{K}(Y_i)\le 1\) for \(i = 1, 2, \ldots , n\), therefore by Lemma 1, it follows that
where \(\Vert \cdot \Vert _{\infty }\) is the \(L_{\infty }\) norm. When \(n(\frac{a_1\delta ^4}{3})^2 \ge M\sqrt{n}\frac{a_1\delta ^4}{3}\), i.e. \(n\ge M^2(\frac{a_1\delta ^4}{3})^{-2}\), taking \(c_3 = (\frac{a_1\delta ^4}{3})^{2}=\frac{1}{9}(\frac{c_2}{c_2+1})^2\delta ^8\), we have
\(\square \)
Lemma 4
Under Condition (C1), for any \(c_4 >0\), when \(n \ge M^2c_5^{-1}\), where \(c_5 = \frac{1}{9}(\frac{c_4}{1+c_4})^2\delta ^8\),
Proof of Lemma 4
Similar to Lemma 3, taking \(a_2=c_4/(1+c_4)\), i.e., \(a_2\in (0,1)\), it follows that
When \(n(\frac{a_2\delta ^4}{3})^2 \ge M\sqrt{n}\frac{a_2\delta ^4}{3}\), i.e. \(n\ge M^2(\frac{a_2\delta ^4}{3})^{-2}\), taking \(c_5 = (\frac{a_2\delta ^4}{3})^{2}=\frac{1}{9}(\frac{c_4}{c_4+1})^2\delta ^8\), we have
\(\square \)
Proof of Theorem 1
We now proof the first statement. To start with, rewrite
Thus,
For \(I_{k1}\), denote
where
We can prove that
Then,
For \(J_{k1}\),
Let
with Conditions (C1) and (C2), it follows that \(E\exp \{sf_2(W_i,W_j,W_l)\} < \infty \). Thus by Lemma 2, when \(n > 3\), we have
By Lemma 3 , let \(c_2 = 1\), then \(c_3 = \delta ^8/{36}\), when \(n\ge 36M^2\delta ^{-8}\), we have
Therefore, by Eqs. (10) and (11),
Since \(Ef_2(W_i,W_j,W_l)\le \frac{1}{\delta ^3}\sup _{k}E^2|X_{k}|\), by Lemma 3 , when \(\frac{\delta ^3cn^{-\kappa }}{4\sup _kE^2|X_k|} \le 1\) and \(n\ge 36M^2\delta ^{-8}\),
For \(J_{k2}\), denote
it is verified that \(E\exp \{sf_3(W_i,W_j,W_l)\} < \infty \), then by Lemma 2,
Using triangle inequality and Eqs. (12)–(14), when \(n\ge m_0\), where \(m_0 = \max \{3, 36M^2\delta ^{-8}, [\frac{c\delta ^3}{4\sup _kE^2|X_k|}]^{\frac{1}{\kappa }}\}\), it follows that
For \(I_{k2}\), under Condition (C1), it follows that
Then taking \(c_4 = 1\) and \(c_5 = \delta ^8/36\), when \(n \ge 36M^2\delta ^{-8}\), with Condition (C2) and Lemma 4, we have
Therefore, by Eqs. (9), (15) and (16), when \(n > m_0\), it follows that
where \(c_6 = c\delta ^3/8\).
For the second statement, take \(\gamma _n = c_0n^{-\kappa }\), where \(c_0\le c\), then \(\gamma _n\le cn^{-\kappa }\), therefore,
let \(\mathscr {A}_n = \left\{ \max _{k\in \mathscr {A}}|\widehat{\omega }_k-\omega _k|\le cn^{-\kappa }\right\} \), if \(\mathscr {A}\nsubseteq \widehat{\mathscr {A}}\), there exist some \(k\in \mathscr {A}\), such that \(\widehat{\omega }_k<cn^{-\kappa }\), by Assumption (C3),
Therefore,
where s is the cardinality of \(\mathscr {A}\), here we complete the proof of Theorem 1. \(\square \)
Proof of Theorem 2
Recall \(w_k(t)\), by Condition (C4), for \(k\in \mathscr {I}\) and \(t \in \varPsi _T\), we can prove that,
and thus \(\omega _k = 0\). It follows from Condition (C3) that \(\min _{k\in \mathscr {A}} \omega _k - \max _{k\in \mathscr {I}} \omega _k > 2cn^{-\kappa }\). Thus,
here we complete the proof of Theorem 2.\(\square \)
Appendix B
See Table 12.
Rights and permissions
About this article
Cite this article
Pan, J., Yu, Y. & Zhou, Y. Nonparametric independence feature screening for ultrahigh-dimensional survival data. Metrika 81, 821–847 (2018). https://doi.org/10.1007/s00184-018-0660-5
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00184-018-0660-5