Abstract
A prediction model is calibrated if, roughly, for any percentage x we can expect that x subjects out of 100 experience the event among all subjects that have a predicted risk of x%. Typically, the calibration assumption is assessed graphically but in practice it is often challenging to judge whether a “disappointing” calibration plot is the consequence of a departure from the calibration assumption, or alternatively just “bad luck” due to sampling variability. We propose a graphical approach which enables the visualization of how much a calibration plot agrees with the calibration assumption to address this issue. The approach is mainly based on the idea of generating new plots which mimic the available data under the calibration assumption. The method handles the common non-trivial situations in which the data contain censored observations and occurrences of competing events. This is done by building on ideas from constrained non-parametric maximum likelihood estimation methods. Two examples from large cohort data illustrate our proposal. The ‘wally’ R package is provided to make the methodology easily usable.
Similar content being viewed by others
References
Aalen OO, Johansen S (1978) An empirical transition matrix for non-homogeneous Markov chains based on censored observations. Scand J Stat 5:141–150
Andersen PK, Borgan Ø, Gill RD, Keiding N (1993) Statistical models based on counting processes. Springer, New York
Austin PC, Steyerberg EW (2014) Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers. Stat Med 33(3):517–535
Barber S, Jennison C (1999) Symmetric tests and confidence intervals for survival probabilities and quantiles of censored survival data. Biometrics 55(2):430–436
Beyersmann J, Allignol A, Schumacher M (2011) Competing risks and multistate models with R. Springer Science & Business Media, Berlin
Blanche P (2017) Confidence intervals for the cumulative incidence function via constrained NPMLE. https://ifsv.sund.ku.dk/biostat/biostat_annualreport/index.php5/Research_reports
Blanche P, Proust-Lima C, Loubère L, Berr C, Dartigues J-F, Jacqmin-Gadda H (2015) Quantifying and comparing dynamic predictive accuracy of joint models for longitudinal marker and time-to-event in presence of censoring and competing risks. Biometrics 71(1):102–113
Bröcker J, Smith LA (2007) Increasing the reliability of reliability diagrams. Weather Forecast 22(3):651–661
Buja A, Cook D, Hofmann H, Lawrence M, Lee E-K, Swayne DF, Wickham H (2009) Statistical inference for exploratory data analysis and model diagnostics. Philos Trans R Soc Lond A Math Phys Eng Sci 367(1906):4361–4383
Camm A et al (2010) Guidelines for the management of atrial fibrillation: the task force for the management of atrial fibrillation of the european society of cardiology (esc). Eur Heart J 31:2369–2429
Crowson CS, Atkinson EJ, Therneau TM (2016) Assessing calibration of prognostic risk scores. Stat Methods Med Res 25:1692–1706
Demler OV, Paynter NP, Cook NR (2015) Tests of calibration and goodness-of-fit in the survival setting. Stat Med 34(10):1659–1680
Efron B (1981) Censored data and the bootstrap. J Am Stat Assoc 76(374):312–319
Ekstrøm CT (2013) Teaching ’instant experience’ with graphical model validation techniques. Teach Stat 36(1):23–26
Fournier M-C, Foucher Y, Blanche P, Buron F, Giral M, Dantan E (2016) A joint model for longitudinal and time-to-event data to better assess the specific role of donor and recipient factors on long-term kidney transplantation outcomes. Eur J Epidemiol 31(5):469–479
Freedman AN, Seminara D, Gail MH, Hartge P, Colditz GA, Ballard-Barbash R, Pfeiffer RM (2005) Cancer risk prediction models: a workshop on development, evaluation, and application. J Natl Cancer Inst 97(10):715–723
Gail MH, Pfeiffer RM (2005) On criteria for evaluating models of absolute risk. Biostatistics 6(2):227–239
Gerds TA, Cai T, Schumacher M (2008) The performance of risk prediction models. Biometr J 50(4):457–479
Gerds TA, Andersen PK, Kattan MW (2014) Calibration plots for risk prediction models in the presence of competing risks. Stat Med 33(18):3191–3203
Geskus RB (2015) Data analysis with competing risks and intermediate states, vol 82. CRC Press, Boca Raton
Handford M (2007) Where is Wally?. Walker Books Ltd, London
Kaplan E, Meier P (1958) Nonparametric estimation from incomplete observations. J Am Stat Assoc 53(282):457–481
Lemeshow S, Hosmer DW (1982) A review of goodness of fit statistics for use in the development of logistic regression models. Am J Epidemiol 115(1):92–106
Li G, Sun Y (2000) A simulation-based goodness-of-fit test for survival data. Stat Probab Lett 47(4):403–410
Lin DY, Wei L-J, Ying Z (1993) Checking the Cox model with cumulative sums of martingale-based residuals. Biometrika 80(3):557–572
Loy A, Follett L, Hofmann H (2016) Variations of Q–Q plots: the power of our eyes!. Am Stat 70(2):202–214
Majumder M, Hofmann H, Cook D (2013) Validation of visual statistical inference, applied to linear models. J Am Stat Assoc 108(503):942–956
Martinussen T, Scheike T (2006) Dynamic regression models for survival data. Springer, Berlin
Pepe M, Janes H (2013) Methods for evaluating prediction performance of biomarkers and tests. In: Lee M-L, Gail G, Cai T, Pfeiffer R, Gandy A (eds) Risk assessment and evaluation of predictions. Springer, Berlin
Pepe MS, Feng Z, Huang Y, Longton G, Prentice R, Thompson IM, Zheng Y (2008) Integrating the predictiveness of a marker with its performance as a classifier. Am J Epidemiol 167(3):362–368
R Core Team (2017) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Robins J, Ritov Y et al (1997) Toward a curse of dimentionality appropriate asymptotic theory for semi-parametric models. Stat Med 16(3):285–319
Steyerberg E (2009) Clinical prediction models: a practical approach to development, validation, and updating. Springer, Berlin
Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, Pencina MJ, Kattan MW (2010) Assessing the performance of prediction models: a framework for some traditional and novel measures. Epidemiology 21(1):128
Thomas DR, Grunkemeier GL (1975) Confidence interval estimation of survival probabilities for censored data. J Am Stat Assoc 70(352):865–871
Tukey J (1972) Some graphic and semigraphic displays. In: Bancroft T (ed) Statistical papers in honor of George W. Snedecor. Iowa State University, Ames, Iowa, p 293–316
Viallon V, Benichou J, Clavel-Chapelon F, Ragusa S (2009) How to evaluate the calibration of a disease risk prediction tool. Stat Med 28:901–916
Vickers A, Cronin A (2010) Everything you always wanted to know about evaluating prediction models (but were too afraid to ask). Urology 76(6):1298–1301
Acknowledgements
PB is grateful to the Bettencourt Schueller foundation for its support. We thank the DIVAT consortium and the Three-City study group for providing the data of the DIVAT and of the Three-City cohorts. Their supports are listed at www.divat.fr and www.three-city-study.com.
Author information
Authors and Affiliations
Corresponding author
Appendices
A Constrained Kaplan–Meier estimates
Let us assume that we observe the data \(\big \{\big (\widetilde{T}_i,\varDelta _i\big ),i=1,\ldots ,n\big \}\), where \(\widetilde{T}_i=\min (T_i,C_i)\) and \(\varDelta _i={\mathbb {1}}\{ T_i \le C_i \}\). The constrained Kaplan–Meier estimate \(\widehat{S}(u)\) of \(S(u)=\mathbb {P}(T>u)\), with constraint \(\widehat{S}(t)=p\), for a given value \(p \in ]0,1[\), is defined as
where \(d_i=\sum _{j=1}^n {\mathbb {1}}\{ \widetilde{T}_j = \widetilde{T}_i \}\varDelta _j\) is the number of observed events at time \(\widetilde{T}_i\), where \(n_i=\sum _{j=1}^n {\mathbb {1}}\{ \widetilde{T}_j \ge \widetilde{T}_i \}\) is number of subjects at risk at time \(\widetilde{T}_i\) and with \(\lambda \in \mathbb {R}\) such that \(\widehat{S}(t)=p\) (Thomas and Grunkemeier 1975). The usual Kaplan–Meier estimator corresponds to the above formulas in the special case where \(\lambda =0\).
B Constrained Aalen–Johansen estimates
Let us assume that we observe the data \(\big \{\big (\widetilde{T}_i,\widetilde{\eta }_i\big ),i=1,\ldots ,n\big \}\) where \(\widetilde{\eta }_i=\varDelta _i\eta _i\). For for sake of clarity, we further assume that there is no ties in the sample \(\big \{\widetilde{T}_i, i=1,\ldots ,n\big \}\). Therefore, without loss of generality, for the formulas below we assume to observe \(0< \widetilde{T}_1< \cdots < \widetilde{T}_n\). In particular, this implies \(n_i=\sum _{j=1}^n {\mathbb {1}}\{ \widetilde{T}_j \ge \widetilde{T}_i \}=n-(i-1)\) for all \(i=1,\ldots ,n\).
The constrained Aalen–Johansen estimates \(\widehat{F}_{k}^{(1)}(u)\) of the cumulative incidence functions of event \(k=1,2\) at time u, that is \(F_{k}(u)=\mathbb {P}(T \le u, \eta =k)\), with constraint \(\widehat{F}_{1}^{(1)}(t)=p\), for a given \(p \in ]0,1[\), is defined as
where
and
In the above equations, we define \(\widetilde{T}_{0}=0\), \(\widehat{F}_{1}^{(1)}(0)=\widehat{F}_{2}^{(1)}(0)=0\) and \(\lambda \in \mathbb {R}\) such that \(\widehat{F}_{1}^{(1)}(t)=p\). The superscript \(^{(1)}\) refers to the fact that the constraint relates to the cumulative incidence function of event 1. Following similar ideas to those used to derive the formulas of “Appendix A” by Thomas and Grunkemeier (1975), these formulas were derived from maximizing the following non-parametric likelihood \(L= \prod _{i=1}^n a_{1,i}^{{\mathbb {1}}\{ \widetilde{\eta }_i=1 \}} a_{2,i}^{{\mathbb {1}}\{ \widetilde{\eta }_i=2 \}} \big ( 1 - a_{1,i} - a_{2,i} \big )^{n-i}\), under the constraint \(\widehat{F}_{1}^{(1)}(t)=p\), using the Lagrange multiplier technique. Extensions of the above formulas can also be derived to account for ties in \(\big \{\widetilde{T}_i, i=1,\ldots ,n\big \}\) (Blanche 2017). The usual Aalen–Johansen estimator corresponds to the above formulas in the special case where \(\lambda =0\).
Rights and permissions
About this article
Cite this article
Blanche, P., Gerds, T.A. & Ekstrøm, C.T. The Wally plot approach to assess the calibration of clinical prediction models. Lifetime Data Anal 25, 150–167 (2019). https://doi.org/10.1007/s10985-017-9414-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10985-017-9414-3