Abstract
Cox regression is one of the most commonly used methods in biomedical research when studying the relationship between a set of covariates and the time up to the occurrence of an event of interest. In research studies, it is not surprising to find missing data, which may compromise the well-known asymptotic properties of the estimators and lead to wrong inferences. In this paper, we present the results of an extensive simulation study on the impact of different methods for the treatment of missing data in estimating the parameters of a Cox model with mixed covariates. The study considers different mechanisms and proportions of missing data and different sample sizes. A variety of five methods are applied for the treatment of missing data and the distributional properties of the estimators of the model parameters; their predictive capacity and the precision of the imputations are compared. In general, the publications that compare imputation techniques in the context of Cox models do so using complete case analysis or multiple imputation. In this paper, the consideration of some flexible imputation methods is proposed. These methods have been shown to provide acceptable results, so their consideration is recommended in cases similar to those raised in this study. Finally, a real motivating case is introduced and the results of the analysis of its information are presented, following the guidelines that arise from the recommendations derived from the simulation study.
Similar content being viewed by others
References
Bailey KR (1983) The asymptotic joint distribution of regression and survival parameter estimates in the Cox regression model. Ann Stat 11(1):39–48
Cox DR (1975) Partial likelihood. Biometrika 62(2):269–276
Næs T (1982) The asymptotic distribution of the estimator for the regression parameter in Cox’s regression model. Scand J Stat 9(2):107–115
Tsiatis AA (1981) A large sample study of Cox’s regression model. Ann Stat 9(1):93–108
Demissie S et al (2003) Bias due to missing exposure data using complete-case analysis in the proportional hazards regression model. Stat Med 22(4):545–557
Little RJA, Rubin DB (2019) Statistical analysis with missing data, 3rd edn. John Wiley & Sons, New York
Dempster AP et al (1977) Maximum likelihood from incomplete data via the EM algorithm. J Roy Stat Soc: Ser B (Methodol) 39(1):1–22
Heckman JJ (1979) Sample selection bias as a specification error. Econom J Econom Soc 47:153–161
Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592
Ali AMG et al (2011) Comparison of methods for handling missing data on immunohistochemical markers in survival analysis of breast cancer. Br J Cancer 104(4):693–699
Hsu C-H, Yu M (2018) Cox regression analysis with missing covariates via nonparametric multiple imputation. Stat Methods Med Res 28(6):1676–1688
Qi L et al (2010) A comparison of multiple imputation and fully augmented weighted estimators for Cox regression with missing covariates. Stat Med 29(25):2592–2604
White IR, Royston P (2009) Imputing missing covariate values for the Cox model. Stat Med 28(15):1982–1998
Clark TG, Altman DG (2003) Developing a prognostic model in the presence of missing data: an ovarian cancer case study. J Clin Epidemiol 56(1):28–37
Jerez JM et al (2010) Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med 50(2):105–115
van Buuren S et al (1999) Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med 18(6):681–694
Guo CY et al (2021) The optimal machine learning-based missing data imputation for the cox proportional hazard model. Front Pub Health. https://doi.org/10.3389/fpubh.2021.68005
Cox DR (1972) Regression models and life-tables. J Roy Stat Soc: Ser B (Methodol) 34(2):187–202
Houari R et al. (2014) Handling missing data problems with sampling methods. 2014 international conference on advanced networking distributed systems and applications (INDS), IEEE
Donders AR et al (2006) Review: a gentle introduction to imputation of missing values. J Clin Epidemiol 59(10):1087–1091
Troyanskaya O et al (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17(6):520–525
Gower JC (1971) A general coefficient of similarity and some of its properties. Biometrics 27(4):857–871
Kagie M et al. (2009) “An empirical comparison of dissimilarity measures for recommender systems.” ERIM report series research in management
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Stekhoven DJ, Bühlmann P (2012) MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1):112–118
van Buuren S, Groothuis-Oudshoorn K (2010) Mice: multivariate imputation by chained equations in R. J Stat Softw 45(3):1–68
Zhang Z (2016) Multiple imputation with multivariate imputation by chained equation (MICE) package. Ann Transl Med 4(2):30
Pedersen AB et al (2017) Missing data and multiple imputation in Clinical Epidemiological Research. Clin Epidemiol 9:157
Sidi Y, Harel O (2018) The treatment of incomplete data: Reporting, analysis, reproducibility, and replicability. Soc Sci Med 209:169–173
Sterne JA et al (2009) Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ 338:b2393
White IR et al (2011) Multiple imputation using chained equations: Issues and guidance for practice. Stat Med 30(4):377–399
Yucel R (2017) Impact of the non-distinctness and non-ignorability on the inference by multiple imputation in multivariate multilevel data: a simulation assessment. J Stat Comput Simul 87(9):1813–1826
Team R (2016) RStudio: Integrated Development Environment for R
Kropko J, Harden JJ (2020) coxed: Duration-Based Quantities of Interest for the Cox Proportional Hazards Model. R package version 0.3.3
Harden JJ, Kropko J (2019) Simulating duration data for the Cox model. Polit Sci Res Methods 7(4):921–928
Templ M et al. (2011) "VIM: visualization and imputation of missing values." R package version 2(3)
Stekhoven DJ (2013) Package ‘missForest’: Nonparametric Missing Value Imputation using Random Forest. Swiss Federal Institute of Technology, Zürich, Switzerland
Rodante DE et al (2019) Predictors of short and long term recurrence of suicidal behavior in Borderline Personality Disorder. Acta Psychiatr Scand 140(2):158–168
Villar Garcı́a M et al (1995) Preparation of a SCID-II-based diagnostic tool for personality disorders. Spanish version. Translation and adaptation. Actas Luso Esp Neurol Psiquiatr Cienc Afines 23(4):178–183
Buss AH, Durkee A (1957) An inventory for assessing different kinds of hostility. J Consult Psychol 21(4):343
Montalván V et al (2001) Spanish adaptation of the Buss-Durkee Hostility Inventory (BDHI). Eur J Psychiatry 15(2):101–112
Bobes J et al (1999) Validation of the Spanish version of the social adaptation scale in depressive patients. Actas Esp Psiquiatr 27(2):71–80
Bosc M et al (1997) Development and validation of a social functioning scale, the social adaptation self-evaluation scale. Eur Neuropsychopharmacol 7(1):S57–S70
Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley, New York
Little RJ et al (2012) The prevention and treatment of missing data in clinical trials. N Engl J Med 367(14):1355–1360
van Ginkel JR et al. (2019) "Rebutting Existing Misconceptions About Multiple Imputation as a Method for Handling Missing Data." Journal of Personality Assessment: 1–12
Nguyen CD et al (2017) Model checking in multiple imputation: an overview and case study. Emerg Themes Epidemiol 14(1):8
Von Elm E et al (2008) Das Strengthening the Reporting of Observational Studies in Epidemiology (STROBE-) statement. Notfall+ Rettungsmedizin 11(4):260–260
Acknowledgements
The results presented in this work have been obtained using the facilities of the CCT-Rosario Computational Centre, member of the High Performance Computing National System (SNCAD, MincyT-Argentina). The good predisposition of your work team to advise and collaborate with the use of the system is appreciated. We also thank Federico Daray and his research team for providing the data used in the actual example presented in this study.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chiapella, L.C., Quaglino, M.B. & Mamprin, M.E. Properties of the Estimators of the Cox Regression Model with Imputed Data. Stat Biosci 15, 330–352 (2023). https://doi.org/10.1007/s12561-022-09361-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12561-022-09361-7