Abstract
This study aims to improve the predictive performance for the event time through the machine learning model and find informative variables in the time-to-event data, simultaneously. To address this issue, after regarding the time-to-event data as the dichotomized counting processes data for predicting survival time, we consider the time-dependent support vector machine (SVM) framework for the dichotomized counting process data, where the decision function in this framework consists of the time-independent risk score and time-dependent intercept. Also, we consider the empirical partial derivative of the risk score function with respect to each marginal predictor as the indicator for the important predictor. Through this approach, it is possible to predict survival time and find variables that affect on the survival time at the same time. Simulation studies were conducted to confirm the performance of the model, and real data analysis was conducted by predicting the survival time of the lung cancer after the diagnosis and selecting genes associate with lung cancer through human gene data.
Similar content being viewed by others
Data availability
We used Beer’s microarray data, which is available with LungCancer3 function in the R package “GSCA”.
References
Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68, 337–404.
Beer, D. G., Kardia, S. L., Huang, C.-C., Giordano, T. J., Levin, A. M., Misek, D. E., Lin, L., Chen, G., Gharib, T. G., Thomas, D. G., et al. (2002). Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nature Medicine, 8, 816–824.
Carleo, A., Landi, C., Prasse, A., Bergantini, L., d’Alessandro, M., Cameli, P., Janciauskiene, S., Rottoli, P., Bini, L., & Bargagli, E. (2020). Proteomic characterization of idiopathic pulmonary fibrosis patients: Stable versus acute exacerbation. Monaldi Archives for Chest Disease, 90, 180–190.
Clarke, B. S., Fokoué, E., & Zhang, H. H. (2009). Principles and Theory for Data Mining and Machine Learning. Springer.
Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society: Series B, 34, 187–202.
Fan, J., & Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20(1), 101–148.
Fleming, T. R., & Harrington, D. P. (2011). Counting Processes and Survival Analysis. Wiley.
Fukumizu, K., & Leng, C. (2014). Gradient-based kernel dimension reduction for regression. Journal of the American Statistical Association, 109, 359–370.
Gustafsson, P. M., Oxelius, V.-A., Nilsson, S., & Kjellman, B. (2008). Association between gm allotypes and asthma severity from childhood to young middle age. Respiratory Medicine, 102, 266–272.
He, X., Wang, J., & Lv, S. (2021). Efficient kernel-based variable selection with sparsistency. Statistica Sinica, 31, 2123–2151.
Ibrahim, J. G., Chen, M.-H., & Sinha, D. (2001). Bayesian Survival Analysis. Springer.
Jeong, S., Kim, C., & Yang, H. (2023). Wasserstein filter for variable screening in binary classification in the reproducing kernel Hilbert space. Journal of Nonparametric Statistics, 1–20 (in press)
Kalbfleisch, J. D., & Prentice, R. L. (2011). The Statistical Analysis of Failure Time Data (2nd ed.). Wiley.
Khan, F. M., & Zubek, V. B. (2008). Support Vector Regression for Censored Data (SVRc): A Novel Tool for Survival Analysis (pp. 863–868). IEEE, IEEE International Conference on Data Mining.
Lawless, J. F. (2002). Statistical Models and Methods for Lifetime Data. Wiley.
Ma, Y., Chen, Y., & Petersen, I. (2017). Expression and epigenetic regulation of cystatin b in lung cancer and colorectal cancer. Pathology-Research and Practice, 213, 1568–1574.
Park, B., & Park, C. (2021). Kernel variable selection for multicategory support vector machines. Journal of Multivariate Analysis, 186, 104800.
Peng, J., Li, W., Tan, N., Lai, X., Jiang, W., & Chen, G. (2022). Usp47 stabilizes bach1 to promote the Warburg effect and non-small cell lung cancer development via stimulating hk2 and gapdh transcription. American Journal of Cancer Research, 12, 91–107.
Tibshirani, R., et al. (1997). The lasso method for variable selection in the cox model. Statistics in Medicine, 16, 385–395.
Van Belle, V., Pelckmans, K., Van Huffel, S., & Suykens, J. A. (2011). Support vector methods for survival analysis: A comparison between ranking and regression approaches. Artificial Intelligence in Medicine, 53, 107–118.
Wang, Q. (2012). Kernel principal component analysis and its applications in face recognition and active shape models. arXiv:1207.3538
Wang, Y., Chen, T., & Zeng, D. (2016). Support vector hazards machine: A counting process framework for learning risk scores for censored outcomes. The Journal of Machine Learning Research, 17, 5825–5861.
Wei, L.-J. (1992). The accelerated failure time model: a useful alternative to the cox regression model in survival analysis. Statistics in Medicine, 11, 1871–1879.
Xia, Y. (2007). A constructive approach to the estimation of dimension reduction directions. The Annals of Statistics, 35, 2654–2690.
Xia, Y., Tong, H., Li, W., & Zhu, L.-X. (2002). An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64, 363–410.
Yang, H., Zhu, H., Ahn, M., & Ibrahim, J. G. (2021). Weighted functional linear cox regression model. Statistical Methods in Medical Research, 30, 1917–1931.
Acknowledgements
We thank the Editor, Associate Editor and two reviewers, whose questions and insightful comments have led to a much improved paper.
Funding
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF2021R1C1C1007023).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jeong, S., Kang, K. & Yang, H. Gradient-based kernel variable selection for support vector hazards machine. J. Korean Stat. Soc. 53, 509–536 (2024). https://doi.org/10.1007/s42952-024-00256-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42952-024-00256-5