# Evaluating discrete choice prediction models when the evaluation data is corrupted: analytic results and bias corrections for the area under the ROC

## Abstract

There has been a growing recognition that issues of data quality, which are routine in practice, can materially affect the assessment of learned model performance. In this paper, we develop some analytic results that are useful in sizing the biases associated with tests of discriminatory model power when these are performed using corrupt (“noisy”) data. As it is sometimes unavoidable to test models with data that are known to be corrupt, we also provide some guidance on interpreting results of such tests. In some cases, with appropriate knowledge of the corruption mechanism, the true values of the performance statistics such as the area under the ROC curve may be recovered (in expectation), even when the underlying data have been corrupted. We also provide estimators of the standard errors of such recovered performance statistics. An analysis of the estimators reveals interesting behavior including the observation that “noisy” data does not “cancel out” across models even when the same corrupt data set is used to test multiple candidate models. Because our results are analytic, they may be applied in a broad range of settings and this can be done without the need for simulation.

## Keywords

ROC Model validation Prediction Data corruption Bias correction Misclassification Credit models Machine learning## Mathematics Subject Classification

62-07 62G10## Notes

### Acknowledgments

The author is grateful to Sanjiv Das, David Fagnan, Lisa Goldberg and Mitchell Petersen for detailed comments on earlier drafts of this paper. The author is particulalry grateful to Foster Provost who provided extensive and detailed suggestions on improving the exposition and extending the results—including suggesting the idea of a recovered ROC. This article was greatly improved by the observations and suggestions of three annonomous reviewers. All errors are, of course, my own. The views expressed in this article are those of the author and do not represent the views of former employers or any of their affiliates.

## References

- Bamber D (1975) The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J Math Psychol 12(4):387–415MathSciNetCrossRefzbMATHGoogle Scholar
- Bohn JR, Stein RM (2009) Active credit portfolio management in practice. Wiley, HobokenCrossRefGoogle Scholar
- DeLong ER, DeLong DM, Clarke-Pearson DL (1988) Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44(3):837–845 ISSN 0006-341XCrossRefzbMATHGoogle Scholar
- Dwyer DW, Stein RM (2006) Inferring the default rate in a population by comparing two incomplete default databases. J Bank Financ 30(3):797–810CrossRefGoogle Scholar
- Charles E (2001) The foundations of cost-sensitive learning. In: Proceedings of the joint conference on artificial intelligence (IJCAI’01), pp 973–978Google Scholar
- Engelmann B, Hayden E, Tasche D (2003) Testing rating accuracy. RISK 16:82–862Google Scholar
- Hand DJ, Till RJ (2001) A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn 45(2):171–186. doi: 10.1023/A:1010920819831 ISSN 08856125CrossRefzbMATHGoogle Scholar
- Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36 ISSN 0033-8419CrossRefGoogle Scholar
- Hausman JA, Abrevaya J, Scott-Morton FM (1998) Misclassification of the dependent variable in a discrete-response setting. J Econ 87(2):239–269MathSciNetCrossRefzbMATHGoogle Scholar
- Heitjan DF, Rubin DB (1991) Ignorability and coarse data. Ann Stat 19:2244–2253. doi: 10.1214/aos/1176348396 MathSciNetCrossRefzbMATHGoogle Scholar
- Hoeffding W (1948) A class of statistics with asymptotically normal distribution. Ann Math Stat 19(3):293–325. doi: 10.2307/2235637 MathSciNetCrossRefzbMATHGoogle Scholar
- Macskassy SA (2005) Foster provost, and saharon rosset. ROC confidence bands: an empirical evaluation. In: Proceedings of the 22st international conference on machine learning. ICML, Bohn, Germany, pp 537–544Google Scholar
- Peterson WW, Birdsall TG, Fox WC (1954) The theory of signal detectibility. Trans IRE Prof Group Inf Theory 2–4:171–212MathSciNetCrossRefGoogle Scholar
- Provost F, Fawcett T (2001) Robust classification for impercise environments. Mach Learn 42(2):203–231Google Scholar
- Russell H, Tanng QK, Dwyer DW (2012) The effect of imperfect data on default prediction validation tests. J Risk Model Valid 6(1):1–20Google Scholar
- Sobehart J, Keenan S, Stein R (2000) Validation methodologies for default risk models. Credit 16:51–56Google Scholar
- Stein RM (2007) Benchmarking default prediction models: pitfalls and remedies in model validation. J Risk Model Valid 1(1):77–113Google Scholar
- Stein RM, Kocagil AE, Bohn J, Akhavain J (2003) Systematic and idiosyncratic risk in middle-market default prediction: a study of the performance of the RiskCalc and PFM Models. MKMV Special CommentGoogle Scholar
- York D (1969) Least squares fitting of a straight line with correlated errors. Earth Sci Planet Sci Lett 5:320–324CrossRefGoogle Scholar