Lifetime Data Analysis

, Volume 19, Issue 2, pp 202–218

Understanding increments in model performance metrics

  • Michael J. Pencina
  • Ralph B. D’Agostino
  • Joseph M. Massaro
Article

Abstract

The area under the receiver operating characteristic curve (AUC) is the most commonly reported measure of discrimination for prediction models with binary outcomes. However, recently it has been criticized for its inability to increase when important risk factors are added to a baseline model with good discrimination. This has led to the claim that the reliance on the AUC as a measure of discrimination may miss important improvements in clinical performance of risk prediction rules derived from a baseline model. In this paper we investigate this claim by relating the AUC to measures of clinical performance based on sensitivity and specificity under the assumption of multivariate normality. The behavior of the AUC is contrasted with that of discrimination slope. We show that unless rules with very good specificity are desired, the change in the AUC does an adequate job as a predictor of the change in measures of clinical performance. However, stronger or more numerous predictors are needed to achieve the same increment in the AUC for baseline models with good versus poor discrimination. When excellent specificity is desired, our results suggest that the discrimination slope might be a better measure of model improvement than AUC. The theoretical results are illustrated using a Framingham Heart Study example of a model for predicting the 10-year incidence of atrial fibrillation.

Keywords

Risk prediction Discrimination AUC IDI Youden index relative utility 

References

  1. Baker SG, Cook NR, Vickers A et al (2009) Using relative utility curves to evaluate risk prediction. J R Stat Soc Ser A Stat Soc 172(4):729–748MathSciNetCrossRefGoogle Scholar
  2. Cook NR (2007) Use and misuse of the receiver operating characteristics curve in risk prediction. Circulation 115(7):928–935CrossRefGoogle Scholar
  3. Cox DR (1972) Regression models and life tables. J R Stat Soc Ser B 34:187–220MATHGoogle Scholar
  4. D’Agostino RB Sr, Pencina MJ (2012) Invited commentary: clinical usefulness of the framingham cardiovascular risk profile beyond its statistical performance. Am J Epidemiol 176(3):187–189Google Scholar
  5. DeLong ER, DeLong DM, Clarke-Pearson DL (1988) Comparing areas under two or more correlated reciever operating characteristics curves: a nonparamentric approach. Biometrics 44(3):837–845MATHCrossRefGoogle Scholar
  6. Demler OV, Pencina MJ, D’Agostino RB Sr (2012) Misuse of DeLong test to compare AUCs for nested models. Stat Med 31:2577–2587CrossRefGoogle Scholar
  7. Dodd LE, Pepe MS (2003) Partial AUC estimation and regression. Biometrics 54:614–623MathSciNetCrossRefGoogle Scholar
  8. Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugenics 7:179–188CrossRefGoogle Scholar
  9. Gail MH, Pfeiffer RM (2005) On criteria for evaluating models of absolute risk. Biostatistics 6(2):227–239MATHCrossRefGoogle Scholar
  10. Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36Google Scholar
  11. Hilden J, Glashiou P (1996) Regret graphs, diagnostic uncertainty and the Youden’s index. Stat Med 15: 969–986Google Scholar
  12. Mahalanobis PC (1936) On the generalized distance in statistics. Proc Natl Inst Sci India 2:49–55MATHGoogle Scholar
  13. Morrison DF (1990) Multivariate statistical methods, 3rd edn. McGraw-Hill, New YorkGoogle Scholar
  14. Pencina MJ, D’Agostino RB Sr, D’Agostino RB Jr et al (2008) Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat Med 27(2):157–172MathSciNetCrossRefGoogle Scholar
  15. Pencina MJ, D’Agostino RB Sr, Steyerberg E (2011) Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Stat Med 30(1):11–21MathSciNetCrossRefGoogle Scholar
  16. Pencina MJ, D’Agostino RB, Demler OV (2012) Novel metrics for evaluating improvement in discrimination: net reclassification and integrated discrimination improvement for normal variables and nested models. Stat Med 31:101–113MathSciNetCrossRefGoogle Scholar
  17. Pepe MS, Janes H, Longton G et al (2004) Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. Am J Epidemiol 159(9):882–890CrossRefGoogle Scholar
  18. Schnabel RB, Larson MG, Yamamoto JF et al (2010) Relations of biomarkers of distinct pathophysiological pathways and atrial fibrillation incidence in the community. Circulation 121(2):200–207CrossRefGoogle Scholar
  19. Steyerberg EW, Vickers AJ, Cook NR et al (2010) Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology 21(1):128–138CrossRefGoogle Scholar
  20. Steyerberg EW, Pencina MJ, Lingsma HF et al (2012) Assessing the incremental value of diagnostic and prognostic markers: a review and illustration. Eur J Clin Invest 42(2):216–228CrossRefGoogle Scholar
  21. Su JQ, Liu JS (1993) Linear combinations of multiple diagnostic markers. J Am Stat Assoc 88:1350–1355MathSciNetMATHCrossRefGoogle Scholar
  22. Tzoulaki I, Liberopoulos G, Ioannidis JPA (2009) Assessment of claims of improved prediction beyond the Framingham risk score. JAMA 302(21):2345–2352CrossRefGoogle Scholar
  23. Vapnik V (1998) Statistical learning theory. Wiley, New YorkMATHGoogle Scholar
  24. Vickers AJ, Elkin EB (2006) Decision curve analysis: a novel method for evaluating prediction models. Med Decis Mak 26(6):565–574CrossRefGoogle Scholar
  25. Walker SH, Duncan DB (1967) Estimation of the probability of an event as a function of several independent variables. Biometrika 54:167–179MathSciNetMATHGoogle Scholar
  26. Ware JH (2006) The limitations of risk factors as prognostic tools. N Engl J Med 355:25CrossRefGoogle Scholar
  27. Yates JF (1982) External correspondence: decomposition of the mean probability score. Organ Behav Hum Per 30:132–156CrossRefGoogle Scholar
  28. Youden WJ (1950) Index for rating diagnostic tests. Cancer 3(1):32–35CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2012

Authors and Affiliations

  • Michael J. Pencina
    • 1
  • Ralph B. D’Agostino
    • 2
  • Joseph M. Massaro
    • 1
  1. 1.Department of BiostatisticsHarvard Clinical Research Institute, Boston UniversityBostonUSA
  2. 2.Department of Mathematics and StatisticsHarvard Clinical Research Institute, Boston UniversityBostonUSA

Personalised recommendations