Abstract
Feature screening has been seen as the first step in analyzing the ultrahigh-dimensional data with the censored survival time. In this article, we develop a surrogate-variable-based model-free feature screening approach for the censored data under the general censoring mechanism, where the censoring variable may depend on the survival variable and the covariates. This approach is developed by finding some observable variables whose active covariates contain the active covariates of the survival variable as a subset, respectively. Then, any existing model-free feature screening method with the sure screening property for full data can be applied to estimating the sets of the active covariates of the observable variables and hence the set of the active covariates of the survival variable. The sure screening property of the proposed approach is established, and its finite sample performances are demonstrated through some simulations. Further, we illustrate the proposed approach by analyzing two real datasets.
Similar content being viewed by others
References
Chang, J., Tang, C. Y., Wu, Y. (2013). Marginal empirical likelihood and sure independence feature screening. The Annals of Statistics, 41, 2123–2148.
Chen, X., Chen, X., Wang, H. (2018). Robust feature screening for ultra-high dimensional right censored data via distance correlation. Computational Statistics and Data Analysis, 119, 118–138.
Cui, H., Li, R., Zhong, W. (2015). Model-free feature screening for ultrahigh dimensional discriminant analysis. Journal of the American Statistical Association, 110, 630–641.
Fan, J., Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society, Series B, 70, 849–911.
Fan, J., Song, R. (2010). Sure independence screening in generalized linear models with NP-dimensionality. The Annals of Statistics, 38, 3567–3604.
Fan, J., Feng, Y., Wu, Y. (2010). High-dimensional variable selection for Cox’s proportional hazards model. In: Borrowing Strength: Theory Powering Applications-a Festschrift for Lawrence D. Brown, Vol. 6 (70–86). Institute of Mathematical Statistics.
Fan, J., Feng, Y., Song, R. (2011). Nonparametric independence screening in sparse ultra-high dimensional additive models. Journal of the American Statistical Association, 106, 544–557.
Gorst-Rasmussen, A., Scheike, T. (2013). Independent screening for single-index hazard rate models with ultrahigh dimensional features. Journal of the Royal Statistical Society, Series B, 75, 217–245.
He, X., Wang, L., Hong, H. G. (2013). Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. The Annals of Statistics, 41, 342–369.
Leung, K. M., Elashoff, R. M., Afifi, A. A. (1997). Censoring issues in survival analysis. Annual Review of Public Health, 18, 83–104.
Li, G., Peng, H., Zhang, J., Zhu, L. (2012a). Robust rank correlation based screening. The Annals of Statistics, 40, 1846–1877.
Li, J., Zheng, Q., Peng, L., Huang, Z. (2016). Survival impact index and ultrahigh-dimensional model-free screening with survival outcomes. Biometrics, 72, 1145–1154.
Li, R., Zhong, W., Zhu, L. (2012b). Feature screening via distance correlation learning. Journal of American Statistical Association, 107, 1129–1139.
Liu, Y., Zhang, J., Zhao, X. (2018). A new nonparametric screening method for ultrahigh-dimensional survival data. Computational Statistics and Data Analysis, 119, 74–85.
Mai, Q., Zou, H. (2015). The fused Kolmogorov filter: A nonparametric model-free screening method. The Annals of Statistics, 43, 1471–1497.
Pan, W. L., Wang, X. Q., Xiao, W. N., Zhu, H. T. (2019). A generic sure independence screening procedure. Journal of American Statistical Association, 114, 928–937.
Rosenwald, A., Wright, G., Wiestner, A., Chan, W. C., Connors, J. M., Campo, E., et al. (2003). The proliferation gene expression signature is a quantitative integrator of oncogenic events that predicts survival in mantle cell lymphoma. Cancer Cell, 3, 185–197.
Song, R., Lu, W., Ma, S., Jessie Jeng, X. (2014). Censored rank independence screening for high-dimensional survival data. Biometrika, 101, 799–814.
Van Houwelingen, H. C., Bruinsma, T., Hart, A. A., van’t Veer, L. J., Wessels, L. F. (2006). Cross-validated Cox regression on microarray gene expression data. Statistics in Medicine, 25, 3201–3216.
Van’t Veer, L., Dai, H., van de Vijver, M., He, Y., Hart, A., Mao, M., van der, H. P. K., Marton, M., Witteveen, A., Schreiber, G., Kerkhoven, R., Roberts, C., Linsley, P., Bernards, R., Firend, S. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, 530–536.
Yang, G., Yu, Y., Li, R., Buu, A. (2016). Feature screening in ultrahigh dimensional Cox’s model. Statistica Sinica, 26, 881–901.
Zhao, S. D., Li, Y. (2012). Principled sure independence screening for Cox models with ultra-high-dimensional covariates. Journal of Multivariate Analysis, 105, 397–411.
Zhong, W., Zhu, L., Li, R., Cui, H. (2016). Regularized quantile regression and robust feature screening for single index models. Statistica Sinica, 26, 69–95.
Zhu, L. P., Li, L., Li, R., Zhu, L. X. (2011). Model-free feature screening for ultrahigh-dimensional data. Journal of American Statistical Association, 106, 1464–1475.
Acknowledgements
Wang’s research was supported by the National Natural Science Foundation of China (General program 11871460 and program for Innovative Research Group in China 61621003), a grant from the Key Lab of Random Complex Structure and Data Science, CAS.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Proof of Lemma 1
To facilitate the presentation, we write \(\mathbf{X} _{\mathcal {A}}=\{X_{k}:k\in \mathcal {A}\}\) for any non-negative integer set \(\mathcal {A}\). First, we prove Lemma 1 (i).
Under the GC mechanism, for any \(t\in [0,\tau )\), we have
Recalling the definition of \(\mathcal {A}(Y|\mathbf{X} )\), it is easy to see \(\mathcal {A}(Y|\mathbf{X} )\subseteq \mathcal {A}(T|\mathbf{X} ) \cup \mathcal {A}^{*}(C|\mathbf{X} )\). On the other hand, for any \(X_{j}\in \mathbf{X} _{\mathcal {A}(T|\mathbf{X} ) \cup \mathcal {A}^{*}(C|\mathbf{X} )}\), we have \(X_{j}\in \mathbf{X} _{\mathcal {A}(T|\mathbf{X} )}\) or \(X_{j}\in \mathbf{X} _{\mathcal {A}^{*}(C|\mathbf{X} )}\). That is, \(pr(T>t|\mathbf{X} )\) or \(pr(C>t|T>t, \mathbf{X} )\) depend functionally on \(X_{j}\) for some \(t\in [0,\tau )\), and hence \(pr(Y>t|\mathbf{X} )\) depends functionally on \(X_{j}\) by the conditions of Lemma 1. This proves \(X_{j}\in \mathbf{X} _{\mathcal {A}(Y|\mathbf{X} )}\). Lemma 1 (i) is then proved.
Lemma 1 (ii) can be proved similar to Lemma 1(i) by noting
for \(t\in [0,\tau )\). \(\square\)
Proofs of Theorems 1 and 2
The proofs are direct based on Lemma 1, and hence we omit it. \(\square\)
Proof of Lemma 2
Under the CRC mechanism, namely, , we then have \(\mathcal {A}^{*}(C|\mathbf{X} )\) is an empty set and \(\mathcal {A}^{*}(\delta |\mathbf{X} )\) is a subset of \(\mathcal {A}(T|\mathbf{X} )\). This proves Lemma 2. \(\square\)
Proof of Lemma 3
Under the RC mechanism, Lemma 3 is a direct result of Lemma 1 by noting \(\mathcal {A}^{*}(C|\mathbf{X} )=\mathcal {A}(C|\mathbf{X} ).\) \(\square\)
About this article
Cite this article
Zhang, J., Wang, Q. & Wang, X. Surrogate-variable-based model-free feature screening for survival data under the general censoring mechanism. Ann Inst Stat Math 74, 379–397 (2022). https://doi.org/10.1007/s10463-021-00801-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-021-00801-7