Data Mining and Knowledge Discovery

, Volume 30, Issue 4, pp 763–796 | Cite as

Evaluating discrete choice prediction models when the evaluation data is corrupted: analytic results and bias corrections for the area under the ROC

  • Roger M. Stein


There has been a growing recognition that issues of data quality, which are routine in practice, can materially affect the assessment of learned model performance. In this paper, we develop some analytic results that are useful in sizing the biases associated with tests of discriminatory model power when these are performed using corrupt (“noisy”) data. As it is sometimes unavoidable to test models with data that are known to be corrupt, we also provide some guidance on interpreting results of such tests. In some cases, with appropriate knowledge of the corruption mechanism, the true values of the performance statistics such as the area under the ROC curve may be recovered (in expectation), even when the underlying data have been corrupted. We also provide estimators of the standard errors of such recovered performance statistics. An analysis of the estimators reveals interesting behavior including the observation that “noisy” data does not “cancel out” across models even when the same corrupt data set is used to test multiple candidate models. Because our results are analytic, they may be applied in a broad range of settings and this can be done without the need for simulation.


ROC Model validation Prediction Data corruption Bias correction Misclassification Credit models Machine learning 

Mathematics Subject Classification

62-07 62G10 



The author is grateful to Sanjiv Das, David Fagnan, Lisa Goldberg and Mitchell Petersen for detailed comments on earlier drafts of this paper. The author is particulalry grateful to Foster Provost who provided extensive and detailed suggestions on improving the exposition and extending the results—including suggesting the idea of a recovered ROC. This article was greatly improved by the observations and suggestions of three annonomous reviewers. All errors are, of course, my own. The views expressed in this article are those of the author and do not represent the views of former employers or any of their affiliates.


  1. Bamber D (1975) The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J Math Psychol 12(4):387–415MathSciNetCrossRefzbMATHGoogle Scholar
  2. Bohn JR, Stein RM (2009) Active credit portfolio management in practice. Wiley, HobokenCrossRefGoogle Scholar
  3. DeLong ER, DeLong DM, Clarke-Pearson DL (1988) Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44(3):837–845 ISSN 0006-341XCrossRefzbMATHGoogle Scholar
  4. Dwyer DW, Stein RM (2006) Inferring the default rate in a population by comparing two incomplete default databases. J Bank Financ 30(3):797–810CrossRefGoogle Scholar
  5. Charles E (2001) The foundations of cost-sensitive learning. In: Proceedings of the joint conference on artificial intelligence (IJCAI’01), pp 973–978Google Scholar
  6. Engelmann B, Hayden E, Tasche D (2003) Testing rating accuracy. RISK 16:82–862Google Scholar
  7. Hand DJ, Till RJ (2001) A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn 45(2):171–186. doi: 10.1023/A:1010920819831 ISSN 08856125CrossRefzbMATHGoogle Scholar
  8. Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36 ISSN 0033-8419CrossRefGoogle Scholar
  9. Hausman JA, Abrevaya J, Scott-Morton FM (1998) Misclassification of the dependent variable in a discrete-response setting. J Econ 87(2):239–269MathSciNetCrossRefzbMATHGoogle Scholar
  10. Heitjan DF, Rubin DB (1991) Ignorability and coarse data. Ann Stat 19:2244–2253. doi: 10.1214/aos/1176348396 MathSciNetCrossRefzbMATHGoogle Scholar
  11. Hoeffding W (1948) A class of statistics with asymptotically normal distribution. Ann Math Stat 19(3):293–325. doi: 10.2307/2235637 MathSciNetCrossRefzbMATHGoogle Scholar
  12. Macskassy SA (2005) Foster provost, and saharon rosset. ROC confidence bands: an empirical evaluation. In: Proceedings of the 22st international conference on machine learning. ICML, Bohn, Germany, pp 537–544Google Scholar
  13. Peterson WW, Birdsall TG, Fox WC (1954) The theory of signal detectibility. Trans IRE Prof Group Inf Theory 2–4:171–212MathSciNetCrossRefGoogle Scholar
  14. Provost F, Fawcett T (2001) Robust classification for impercise environments. Mach Learn 42(2):203–231Google Scholar
  15. Russell H, Tanng QK, Dwyer DW (2012) The effect of imperfect data on default prediction validation tests. J Risk Model Valid 6(1):1–20Google Scholar
  16. Sobehart J, Keenan S, Stein R (2000) Validation methodologies for default risk models. Credit 16:51–56Google Scholar
  17. Stein RM (2007) Benchmarking default prediction models: pitfalls and remedies in model validation. J Risk Model Valid 1(1):77–113Google Scholar
  18. Stein RM, Kocagil AE, Bohn J, Akhavain J (2003) Systematic and idiosyncratic risk in middle-market default prediction: a study of the performance of the RiskCalc and PFM Models. MKMV Special CommentGoogle Scholar
  19. York D (1969) Least squares fitting of a straight line with correlated errors. Earth Sci Planet Sci Lett 5:320–324CrossRefGoogle Scholar

Copyright information

© The Author(s) 2015

Authors and Affiliations

  1. 1.MIT Laboratory for Financial EngineeringCambridgeUSA

Personalised recommendations