Abstract
In this paper we design a sure independent ranking and screening procedure for censored regression (cSIRS, for short) with ultrahigh dimensional covariates. The inverse probability weighted cSIRS procedure is model-free in the sense that it does not specify a parametric or semiparametric regression function between the response variable and the covariates. Thus, it is robust to model mis-specification. This model-free property is very appealing in ultrahigh dimensional data analysis, particularly when there is lack of information for the underlying regression structure. The cSIRS procedure is also robust in the presence of outliers or extreme values as it merely uses the rank of the censored response variable. We establish both the sure screening and the ranking consistency properties for the cSIRS procedure when the number of covariates p satisfies \(p=o\{\exp (an)\}\), where a is a positive constant and n is the available sample size. The advantages of cSIRS over existing competitors are demonstrated through comprehensive simulations and an application to the diffuse large-B-cell lymphoma data set.
Similar content being viewed by others
References
Fan, J., Feng, Y., Wu, Y.: High-dimensional variable selection for coxs proportional hazards model. IMS Collect. 6, 70–86 (2010)
Fan, J., Li, R.: Variable selection for cox’s proportional hazards model and frailty model. Ann. Stat. 30, 74–99 (2002)
Fan, J., Lv, J.: Sure independence screening for ultrahigh dimensional feature space (with discussion). J. R. Stat. Soc. B 70, 849–911 (2008)
Fan, J., Samworth, R., Wu, Y.: Ultrahigh dimensional feature selection: beyond the linear model. J. Mach. Learn. Res. 10, 1829–1853 (2009)
Fan, J., Song, R.: Sure independence screening in generalized linear models with NP-Dimensionality. Ann. Stat. 38, 3567–3604 (2010)
Fang, K.T., Kotz, S., Ng, K.W.: Symmetric Multivariate and Related Distributions. Chapman & Hall, London (1989)
Gorst-Rasmussen, A., Scheike, T.: Independent screening for single-index hazard rate models with ultrahigh dimensional features. J. R. Stat. Soc. B 75, 217–245 (2013)
He, X., Wang, L., Hong, H.: Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Ann. Stat. 41, 342–369 (2013)
Li, G., Peng, H., Zhang, J., Zhu, L.: Robust rank correlation based screening. Ann. Stat. 40, 1846–1877 (2012a)
Li, R., Zhong, W., Zhu, L.: Feature screening via distance correlation learning. J. Am. Stat. Assoc. 107, 1129–1139 (2012b)
Lo, S.H., Singh, K.: The product-limit estimator and the bootstrap: some asymptotic representations. Probab. Theory Relat. Fields 71, 455–465 (1986)
Lu, W., Li, L.: Boosting methods for nonlinear transformation models with censored survival data. Biostatistics 9, 658–667 (2008)
Rosenwald, A., Wright, G., Chan, W.C., Connors, J.M., Hermelink, H.K., Smeland, E.B., Staudt, L.M.: The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N. Engl. J. Med. 346, 1937–1947 (2002)
Serfling, R.J.: Approximation Theorems of Mathematical Statistics. Wiley, New York (1980)
Tibshirani, R.: Regression shrinkage and selection via lasso. J. R. Stat. Soc. B 58, 267–288 (1996)
Uno, H., Cai, T., Pencina, M.J., D’Agostino, R.B., Wei, L.J.: On the c-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Stat. Med. 30, 1105–1117 (2011)
Zhao, S.D., Li, Y.: Principled sure independence screening for cox models with ultra-high-dimensional covariates. J. Multivar. Anal. 105, 397–411 (2012)
Zhu, L.P., Li, L., Li, R., Zhu, L.X.: Model-free feature screening for ultrahigh dimensional data. J. Am. Stat. Assoc. 106, 1464–1475 (2011)
Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101, 1418–1429 (2006)
Acknowledgments
Tingyou Zhou’s research is supported by Shanghai University of Finance and Economics Innovation Fund of Graduate Student (CXJJ-2014-447). Liping Zhus research is supported by National Natural Science Foundation of China (11371236 and 11422107), Henry Fok Education Foundation Fund of Young College Teachers (141002) and Innovative Research Team in University of China (IRT13077), Ministry of Education of China. All correspondence should be directed to Liping Zhu at zhu.liping@ruc.edu.cn. The authors thank the Editor, an Associate Editor and the anonymous reviewers for their constructive suggestions, which have helped greatly improve the presentation of our paper.
Author information
Authors and Affiliations
Corresponding author
Appendix 1: Proof of theorems
Appendix 1: Proof of theorems
1.1 Appendix 1.1: Proof of theorem 1
We first observe that \(\omega _{k} = \omega _{k,1}\). Thus it suffices to show that
With the conditional independence model (2.1) and the linearity condition, we have
Let \({\varvec{\Omega }}_{{\mathcal {A}}}(t) \!=\! E\left\{ \mathbf {x}_{\mathcal {A}}\mathbf {1}(Y\!<\!t) \right\} \) and \({\varvec{\Omega }}_{{\mathcal {A}}} = E\left\{ {\varvec{\Omega }}_{{\mathcal {A}}}(T) {\varvec{\Omega }}^\mathrm{\tiny {T}}_{{\mathcal {A}}}(T)\right\} \). Thus,
Without much difficulty, we can obtain that
which implies the desired result.
1.2 Appendix 1.2: Proof of theorem 2
We merely prove the case \({\widehat{G}}_k(t\mid X_k) = {\widehat{G}}(t)\) as the proof for the other two cases are very similar. We first show that \(\widehat{\omega }_k= n^3/\{n(n-1)(n-2)\}{\widetilde{\omega }}_k\), a scaled version of \({\widetilde{\omega }}_k\), can be expressed as follows:
Lemma 1
Under Condition (C4), the Kaplan-Meier estimator \(\widehat{G}(\cdot )\) satisfies:
-
(1)
\({\sup }_{0\le t \le T}|\widehat{G}(t)-G(t)|= O\{(\frac{\log n}{n})^{\frac{1}{2}}\}\) almost surely.
-
(2)
\(\{\widehat{G}(t)\}^{-1}-\{G(t)\}^{-1}\!=\! n^{-1}\{G(t)\}^{-2} \sum _{g=1}^n{\xi (T_g,\delta _g,t)}\!+\!R_n(t)\), where \(\xi (T_g,\delta _g,t),g=1,\ldots ,n\) are i.i.d. random variables with mean zero and \({\sup }_{0\le t \le T}|R_n(t)|=O\{(\frac{\log n}{n})^{\frac{3}{4}}\}\) almost surely.
-
(3)
\({\sup }_{0\le t \le T}|\frac{1}{\widehat{G}(t)}-\frac{1}{G(t)}|=O\{(\frac{\log n}{n})^{\frac{1}{2}}\}\) almost surely.
Result (1) can be found in Lemma 3 of Lo and Singh (1986). Direct application of Taylor expansion yields (2) and (3).
Using Lemma 1, we can write
where \(h(\cdot )\) stands for the kernel of the U-statistic \(U_n\), which can be expressed as follows:
In other words, \(U_n\) is a standard U-statistics which can be expressed as
where each \({\mathcal {W}(X_{1k},T_1,\delta _1;\ldots ;X_{nk},T_n,\delta _n)}\) is an average of \(k^{*}=[n/3]\) independent and identically distributed random variables, \(\sum _{n!}\) means the summation of n! permutations \((i_1,\ldots ,i_n)\) of \((1,\ldots ,n).\)
For any \(t\in (0,s_0 k^{*})\), where \(s_0\) is a positive constant, it follows that
The first equality stands because of the monotony of the exponential function and the inequality follows by Markov’s inequality.
Since \(E\{\exp (t\widehat{\omega }_k)\}=E\{\exp (t[U_n+O\{(\frac{\log n}{n})^{\frac{1}{2}}\}])\}=E\{\exp (t U_n)\} \exp [t \cdot O\{(\frac{\log n}{n})^{\frac{1}{2}}\}]\), for any fixed \(t\in (0,s_0 k^{*})\), \(\exp [t \cdot O\{(\frac{\log n}{n})^{\frac{1}{2}}\}]\) goes to 1 as n goes to infinity, together with the application of Jensne’s inequality, we get that
where \({\psi _{h}(s)}=E[\exp \{s \cdot h(X_{jk},T_{j},\delta _{j};X_{ik},T_{i},\delta _{i}; X_{lk},T_{l},\delta _{l})\}],\,\,s\in (0,s_0)\).
The combination of the above results shows that
where \(s=t/k^{*}.\) Since \(E\{h(X_{jk},T_{j},\delta _{j};X_{ik},T_{i},\delta _{i}; X_{lk},T_{l},\delta _{l})\}=\omega _k\), and Taylor expansion shows that for any generic random variable Y, there exists a constant \(s_1\in (0,s)\), such that one can find some random variable Z bounded by \(0<Z<Y^2\exp (s_1 Y)\) satisfies \(\exp (s Y)=1+sY+s^2 Z/2.\) So, we obtain that
Applying Condition (C3) on \(X_k\), we obtain that there exists a constant \(C_0\) such that
Together with Taylor expansion that \(\exp (-s\varepsilon )=1-\varepsilon s+O(s^2)\), it follows that
where \(s=t/k^{*}\in (0,s_0)\) is sufficiently small (as long as t is sufficiently small). Thus, for an arbitrary \(\varepsilon >0\), there exists a small enough constant \(s_{\varepsilon }\) such that
Similarly, we can get that
Consequently,
Next, we prove that
Recall that we set \(\eta ={\min }{\omega _{k}}_{k\in {\mathcal {A}}}-{\max }{\omega _{k}}_{k\in {\mathcal {I}}}\). Therefore,
Applying (2.12) with \(\varepsilon =\eta /2\), we complete the proof of (2.13).
1.3 Appendix 1.3: Proof of theorem 3
We first prove that for any \(\varepsilon > 0\),
From the uniform bound condition of \(\mathbf {x}\), we can see that \(h(X_{jk},T_{j},\delta _{j};X_{ik},T_{i},\delta _{i};X_{lk},T_{l},\delta _{l}) \), the kernel of the U-statistic \(\widehat{\omega }_k\), is also bounded, that is,
Our foregoing arguments show that for any \(t\in (0,s_0 k^{*})\), we have
where \(k^{*}=[n/3]\). Together with the exponential inequality in Lemma 5.6.1.A of Serfling (1980), we obtain that
By choosing \(t=\frac{\tau _0^2 k^{*}}{a^2} \varepsilon \), the right hand side attains its minimum \(\exp (-\frac{\tau _0^2 k^{*}}{2a^2} \varepsilon ^2)\), which together with the symmetry of U-statistic implies the validity of (4.2). Let \(\varepsilon \mathop {=}\limits ^{{\tiny \hbox {def}}}cn^{-\kappa }\). We have
To facilitate our subsequent proof, we write the event \({\mathcal {C}}_n \mathop {=}\limits ^{{\tiny \hbox {def}}}\{{\max }_{k \in {\mathcal {A}}}|\widehat{\omega }_k-\omega _k| \le cn^{-\kappa }\}\). Recall that we assume \({\min }{\omega _k}_{k\in {\mathcal {A}}}\ge 2cn^{-\kappa }\). Under this assumption, if the event \({\mathcal {C}}_n\) occurs, it holds for all \(k \in {\mathcal {A}}\) that \(\widehat{\omega }\ge cn^{-\kappa }\). Thus, we obtain that
Since
The last equation holds because of (4.3). This completes the proof of Theorem 3.
Rights and permissions
About this article
Cite this article
Zhou, T., Zhu, L. Model-free feature screening for ultrahigh dimensional censored regression. Stat Comput 27, 947–961 (2017). https://doi.org/10.1007/s11222-016-9664-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-016-9664-z