Skip to main content
Log in

Properties of the Estimators of the Cox Regression Model with Imputed Data

  • Published:
Statistics in Biosciences Aims and scope Submit manuscript

Abstract

Cox regression is one of the most commonly used methods in biomedical research when studying the relationship between a set of covariates and the time up to the occurrence of an event of interest. In research studies, it is not surprising to find missing data, which may compromise the well-known asymptotic properties of the estimators and lead to wrong inferences. In this paper, we present the results of an extensive simulation study on the impact of different methods for the treatment of missing data in estimating the parameters of a Cox model with mixed covariates. The study considers different mechanisms and proportions of missing data and different sample sizes. A variety of five methods are applied for the treatment of missing data and the distributional properties of the estimators of the model parameters; their predictive capacity and the precision of the imputations are compared. In general, the publications that compare imputation techniques in the context of Cox models do so using complete case analysis or multiple imputation. In this paper, the consideration of some flexible imputation methods is proposed. These methods have been shown to provide acceptable results, so their consideration is recommended in cases similar to those raised in this study. Finally, a real motivating case is introduced and the results of the analysis of its information are presented, following the guidelines that arise from the recommendations derived from the simulation study.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Bailey KR (1983) The asymptotic joint distribution of regression and survival parameter estimates in the Cox regression model. Ann Stat 11(1):39–48

    Article  MathSciNet  MATH  Google Scholar 

  2. Cox DR (1975) Partial likelihood. Biometrika 62(2):269–276

    Article  MathSciNet  MATH  Google Scholar 

  3. Næs T (1982) The asymptotic distribution of the estimator for the regression parameter in Cox’s regression model. Scand J Stat 9(2):107–115

    MathSciNet  MATH  Google Scholar 

  4. Tsiatis AA (1981) A large sample study of Cox’s regression model. Ann Stat 9(1):93–108

    Article  MathSciNet  MATH  Google Scholar 

  5. Demissie S et al (2003) Bias due to missing exposure data using complete-case analysis in the proportional hazards regression model. Stat Med 22(4):545–557

    Article  Google Scholar 

  6. Little RJA, Rubin DB (2019) Statistical analysis with missing data, 3rd edn. John Wiley & Sons, New York

    MATH  Google Scholar 

  7. Dempster AP et al (1977) Maximum likelihood from incomplete data via the EM algorithm. J Roy Stat Soc: Ser B (Methodol) 39(1):1–22

    MathSciNet  MATH  Google Scholar 

  8. Heckman JJ (1979) Sample selection bias as a specification error. Econom J Econom Soc 47:153–161

    MathSciNet  MATH  Google Scholar 

  9. Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592

    Article  MathSciNet  MATH  Google Scholar 

  10. Ali AMG et al (2011) Comparison of methods for handling missing data on immunohistochemical markers in survival analysis of breast cancer. Br J Cancer 104(4):693–699

    Article  Google Scholar 

  11. Hsu C-H, Yu M (2018) Cox regression analysis with missing covariates via nonparametric multiple imputation. Stat Methods Med Res 28(6):1676–1688

    Article  MathSciNet  Google Scholar 

  12. Qi L et al (2010) A comparison of multiple imputation and fully augmented weighted estimators for Cox regression with missing covariates. Stat Med 29(25):2592–2604

    Article  MathSciNet  Google Scholar 

  13. White IR, Royston P (2009) Imputing missing covariate values for the Cox model. Stat Med 28(15):1982–1998

    Article  MathSciNet  Google Scholar 

  14. Clark TG, Altman DG (2003) Developing a prognostic model in the presence of missing data: an ovarian cancer case study. J Clin Epidemiol 56(1):28–37

    Article  Google Scholar 

  15. Jerez JM et al (2010) Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med 50(2):105–115

    Article  Google Scholar 

  16. van Buuren S et al (1999) Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med 18(6):681–694

    Article  Google Scholar 

  17. Guo CY et al (2021) The optimal machine learning-based missing data imputation for the cox proportional hazard model. Front Pub Health. https://doi.org/10.3389/fpubh.2021.68005

    Article  Google Scholar 

  18. Cox DR (1972) Regression models and life-tables. J Roy Stat Soc: Ser B (Methodol) 34(2):187–202

    MathSciNet  MATH  Google Scholar 

  19. Houari R et al. (2014) Handling missing data problems with sampling methods. 2014 international conference on advanced networking distributed systems and applications (INDS), IEEE

  20. Donders AR et al (2006) Review: a gentle introduction to imputation of missing values. J Clin Epidemiol 59(10):1087–1091

    Article  Google Scholar 

  21. Troyanskaya O et al (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17(6):520–525

    Article  Google Scholar 

  22. Gower JC (1971) A general coefficient of similarity and some of its properties. Biometrics 27(4):857–871

    Article  Google Scholar 

  23. Kagie M et al. (2009) “An empirical comparison of dissimilarity measures for recommender systems.” ERIM report series research in management

  24. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MATH  Google Scholar 

  25. Stekhoven DJ, Bühlmann P (2012) MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1):112–118

    Article  Google Scholar 

  26. van Buuren S, Groothuis-Oudshoorn K (2010) Mice: multivariate imputation by chained equations in R. J Stat Softw 45(3):1–68

    Google Scholar 

  27. Zhang Z (2016) Multiple imputation with multivariate imputation by chained equation (MICE) package. Ann Transl Med 4(2):30

    Google Scholar 

  28. Pedersen AB et al (2017) Missing data and multiple imputation in Clinical Epidemiological Research. Clin Epidemiol 9:157

    Article  Google Scholar 

  29. Sidi Y, Harel O (2018) The treatment of incomplete data: Reporting, analysis, reproducibility, and replicability. Soc Sci Med 209:169–173

    Article  Google Scholar 

  30. Sterne JA et al (2009) Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ 338:b2393

    Article  Google Scholar 

  31. White IR et al (2011) Multiple imputation using chained equations: Issues and guidance for practice. Stat Med 30(4):377–399

    Article  MathSciNet  Google Scholar 

  32. Yucel R (2017) Impact of the non-distinctness and non-ignorability on the inference by multiple imputation in multivariate multilevel data: a simulation assessment. J Stat Comput Simul 87(9):1813–1826

    Article  MathSciNet  MATH  Google Scholar 

  33. Team R (2016) RStudio: Integrated Development Environment for R

  34. Kropko J, Harden JJ (2020) coxed: Duration-Based Quantities of Interest for the Cox Proportional Hazards Model. R package version 0.3.3

  35. Harden JJ, Kropko J (2019) Simulating duration data for the Cox model. Polit Sci Res Methods 7(4):921–928

    Article  Google Scholar 

  36. Templ M et al. (2011) "VIM: visualization and imputation of missing values." R package version 2(3)

  37. Stekhoven DJ (2013) Package ‘missForest’: Nonparametric Missing Value Imputation using Random Forest. Swiss Federal Institute of Technology, Zürich, Switzerland

    Google Scholar 

  38. Rodante DE et al (2019) Predictors of short and long term recurrence of suicidal behavior in Borderline Personality Disorder. Acta Psychiatr Scand 140(2):158–168

    Article  Google Scholar 

  39. Villar Garcı́a M et al (1995) Preparation of a SCID-II-based diagnostic tool for personality disorders. Spanish version. Translation and adaptation. Actas Luso Esp Neurol Psiquiatr Cienc Afines 23(4):178–183

    Google Scholar 

  40. Buss AH, Durkee A (1957) An inventory for assessing different kinds of hostility. J Consult Psychol 21(4):343

    Article  Google Scholar 

  41. Montalván V et al (2001) Spanish adaptation of the Buss-Durkee Hostility Inventory (BDHI). Eur J Psychiatry 15(2):101–112

    Google Scholar 

  42. Bobes J et al (1999) Validation of the Spanish version of the social adaptation scale in depressive patients. Actas Esp Psiquiatr 27(2):71–80

    Google Scholar 

  43. Bosc M et al (1997) Development and validation of a social functioning scale, the social adaptation self-evaluation scale. Eur Neuropsychopharmacol 7(1):S57–S70

    Article  MathSciNet  Google Scholar 

  44. Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley, New York

    Book  MATH  Google Scholar 

  45. Little RJ et al (2012) The prevention and treatment of missing data in clinical trials. N Engl J Med 367(14):1355–1360

    Article  Google Scholar 

  46. van Ginkel JR et al. (2019) "Rebutting Existing Misconceptions About Multiple Imputation as a Method for Handling Missing Data." Journal of Personality Assessment: 1–12

  47. Nguyen CD et al (2017) Model checking in multiple imputation: an overview and case study. Emerg Themes Epidemiol 14(1):8

    Article  Google Scholar 

  48. Von Elm E et al (2008) Das Strengthening the Reporting of Observational Studies in Epidemiology (STROBE-) statement. Notfall+ Rettungsmedizin 11(4):260–260

    Article  Google Scholar 

Download references

Acknowledgements

The results presented in this work have been obtained using the facilities of the CCT-Rosario Computational Centre, member of the High Performance Computing National System (SNCAD, MincyT-Argentina). The good predisposition of your work team to advise and collaborate with the use of the system is appreciated. We also thank Federico Daray and his research team for providing the data used in the actual example presented in this study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marta Beatriz Quaglino.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 186 kb)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chiapella, L.C., Quaglino, M.B. & Mamprin, M.E. Properties of the Estimators of the Cox Regression Model with Imputed Data. Stat Biosci 15, 330–352 (2023). https://doi.org/10.1007/s12561-022-09361-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12561-022-09361-7

Keywords

Navigation