## Abstract

There has been a growing recognition that issues of data quality, which are routine in practice, can materially affect the assessment of learned model performance. In this paper, we develop some analytic results that are useful in sizing the biases associated with tests of discriminatory model power when these are performed using corrupt (“noisy”) data. As it is sometimes unavoidable to test models with data that are known to be corrupt, we also provide some guidance on interpreting results of such tests. In some cases, with appropriate knowledge of the corruption mechanism, the true values of the performance statistics such as the area under the ROC curve may be recovered (in expectation), even when the underlying data have been corrupted. We also provide estimators of the standard errors of such recovered performance statistics. An analysis of the estimators reveals interesting behavior including the observation that “noisy” data does not “cancel out” across models even when the same corrupt data set is used to test multiple candidate models. Because our results are analytic, they may be applied in a broad range of settings and this can be done without the need for simulation.

### Similar content being viewed by others

### Explore related subjects

Find the latest articles, discoveries, and news in related topics.## Notes

This is not universally the case. For example, Chap. 7 of Bohn and Stein (2009) deals extensively with model evaluation and data issues are also discussed in Chaps. 3, 4, 6 and 9.

Because our focus is on evaluation rather than estimation (learning) we assume that a model to be evaluated has already been estimated and that the task at hand is to evaluate this model using the available data.

This assumption represents a key area for future resarch. See the discussion in Sect. 4.3.

Note that depending on the circumstances, this may violate the CAR assumption. See Sect. 4.3.

Note that the AUC is a summary statistic. As such, there are many cases in which important information is lost in summarization. In particular, if the ROC curves for two models cross, the AUC may not provide information on the best model for a specific application. In fact, depending on the specific applciation, it may be possible to arrive at a higher AUC using the combined models than either can achieve on its own (Provost and Fawcett 2001).

For a random model, \(E[A]=0.5\), but \(\widehat{A}_0\) will almost surely be different than 0.5 due to sampling error. While it is sometimes convenient to drop the expectation notation for \(\widehat{A}_0, \widehat{A}_0\) cannot be calculated practically without knowledge of the specific data corruption process, so we can

*only*work with expectations. The same is true of estimtes of \(\widehat{A}\), the true, unobserved, value of*A*. This introduces practical issues in calculating higher moments such as covariance. (See Sect. 2.6).Note that we show these results in percentage terms (e.g., a percentage of positive observations mislabeled). We do this for convenience, however, the results are similar directionally regardless of whether we measure in percentage or absolute terms. For example, assume that \(A=0.8\), \(n=10,000\), and \(m=2000\). First examine the case of constant mislabelings if we keep the number of mislabelings fixed at 200. In this case, by (5) we get that mislabelings of “bads” (\(k=200\), \(l=0\)) results in \({A}_{c}=0.794\); in contrast when we mislabel the same number of “goods” (\(k=0\), \(l=200\)) we get a greater degradation and \({A}_{c}=0.773\) results. Now we repeat but keep the percentage of mislabelings constant at 10 %. In this case, we get that mislabelings of “bads” (\(k=200\), \(l=0\)) results in \({A}_{c}=0.794\) as before; mislabeling the same percentage of “goods” (\(k=0, l=1,000\)) again yields a greater degradation as \({A}_{c}=0.7\).

Note that the variances of the two estimates of \(\widehat{A}_0\), corresponding to

*l*and*k*mislabelings of*n*and*m*respectively, will be different. See footnote 9.When \(E[{\widehat{{A}}}_{0}]=0.5\), it can be shown that a reasonable analytic approximation is given by \(\text {Var}(\widehat{A}_{0})=\text {Var}({A}_{0}^{mk})+\text {Var}({A}_{0}^{nl}){\approx }\frac{m+1}{12(m-k)k}+\frac{n+1}{12(n-l)l}\). See Bamber (1975), Eq. (4) for a general form.

For example, assume a credit model produces a high score if it predicts that a firm is likely to default and a low score if it predicts the firm will not default. Intuitively, if the set of

*k*corrupted default records*happens to*include a large proportion of records with higher model scores (relative to the mean of the defaulted records), \(\widehat{A}_{c}\) will be decreased (because the mean score of the “default” records is decreased and the mean score of the “non-default” records is increased, so on average the separation between the two classes is reduced). In this case, \(\widehat{A}_{0}=\widehat{A}_0^{mk}\) will also decrease (there will now be more separation on average between the set of (high score) false “non-default” records and the set of (low score) true “default” records, but the relationship will be inverted since the false “non-default” records have higher scores on average than the true “default” records to which they are compared. See the definition of \(\widehat{A}_0^{mk}\)). Similarly, if the set of*k*corrupted default records*happens to*include a high proportion of very low model scores, \(\widehat{A}_{c}\) should increase (since the mean score of the true “default” records is increased more than is the mean score of the set of all records labeled “non-default”) while \(\widehat{A}_{0}\) will also increase (since there will now be more separation on average between the false “non-default” records and the true “default” records, but this time in the correct direction.). The degree to which this positive correlation is expressed will depend on the true value of*A*, which measures the model’s ability to discriminating between the two clases, the distributions of scores in each class, and the values of*m*,*n*,*k*, and*l*.We estimate the “true” value of

*A*as the mean value of \(\widehat{A}\) across all realizations in each simulation.We bootstrap each simulated corrupted data set

*B*times. Thus, for \(N_S\) simulations paths, we calculate a total of \(N\times {B}\) calculations of the AUC. In Table 3 we use \(B=20,000\) and \(N_S=500\) so the total number of AUC calculations is 10 million per simulation (row). We parallelize the bootstrap to permit multi-threaded evaluation.

## References

Bamber D (1975) The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J Math Psychol 12(4):387–415

Bohn JR, Stein RM (2009) Active credit portfolio management in practice. Wiley, Hoboken

DeLong ER, DeLong DM, Clarke-Pearson DL (1988) Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44(3):837–845 ISSN 0006-341X

Dwyer DW, Stein RM (2006) Inferring the default rate in a population by comparing two incomplete default databases. J Bank Financ 30(3):797–810

Charles E (2001) The foundations of cost-sensitive learning. In: Proceedings of the joint conference on artificial intelligence (IJCAI’01), pp 973–978

Engelmann B, Hayden E, Tasche D (2003) Testing rating accuracy. RISK 16:82–862

Hand DJ, Till RJ (2001) A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn 45(2):171–186. doi:10.1023/A:1010920819831 ISSN 08856125

Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36 ISSN 0033-8419

Hausman JA, Abrevaya J, Scott-Morton FM (1998) Misclassification of the dependent variable in a discrete-response setting. J Econ 87(2):239–269

Heitjan DF, Rubin DB (1991) Ignorability and coarse data. Ann Stat 19:2244–2253. doi:10.1214/aos/1176348396

Hoeffding W (1948) A class of statistics with asymptotically normal distribution. Ann Math Stat 19(3):293–325. doi:10.2307/2235637

Macskassy SA (2005) Foster provost, and saharon rosset. ROC confidence bands: an empirical evaluation. In: Proceedings of the 22st international conference on machine learning. ICML, Bohn, Germany, pp 537–544

Peterson WW, Birdsall TG, Fox WC (1954) The theory of signal detectibility. Trans IRE Prof Group Inf Theory 2–4:171–212

Provost F, Fawcett T (2001) Robust classification for impercise environments. Mach Learn 42(2):203–231

Russell H, Tanng QK, Dwyer DW (2012) The effect of imperfect data on default prediction validation tests. J Risk Model Valid 6(1):1–20

Sobehart J, Keenan S, Stein R (2000) Validation methodologies for default risk models. Credit 16:51–56

Stein RM (2007) Benchmarking default prediction models: pitfalls and remedies in model validation. J Risk Model Valid 1(1):77–113

Stein RM, Kocagil AE, Bohn J, Akhavain J (2003) Systematic and idiosyncratic risk in middle-market default prediction: a study of the performance of the RiskCalc and PFM Models. MKMV Special Comment

York D (1969) Least squares fitting of a straight line with correlated errors. Earth Sci Planet Sci Lett 5:320–324

## Acknowledgments

The author is grateful to Sanjiv Das, David Fagnan, Lisa Goldberg and Mitchell Petersen for detailed comments on earlier drafts of this paper. The author is particulalry grateful to Foster Provost who provided extensive and detailed suggestions on improving the exposition and extending the results—including suggesting the idea of a recovered ROC. This article was greatly improved by the observations and suggestions of three annonomous reviewers. All errors are, of course, my own. The views expressed in this article are those of the author and do not represent the views of former employers or any of their affiliates.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

Responsible editor: Johannes Fürnkranz.

## Appendices

### Appendix A: Some comments on data corruption in the development sample

Although beyond the scope of this paper, we note that the case of data corruption of the development sample represents a more involved problem as it can potentially affect the model parameters themselves. Fortunately, a number of aspects of this problem have been studied extensively in the statistics and econometrics literature, so in many cases, the effects of such mislabeling on model estimation are well understood and fix-ups have been developed for many common problems.

For example, it is well known that in general, random noise to the *independent* variables in regression problems will serve to bias downward the estimates of the model coefficients since the sum of squares of the independent variables (\(\mathbf {X'X}\)) will increase faster than the sum of squares of the dependent and independent variables (\(\mathbf {X'y}\)) implying that the ratio of the two will decline. This type of bias is often termed regression dilution.

For standard regression, corruption of the *dependent* variable will not bias the estimates of model coefficients, however it will reduce the efficiency of the estimation by adding additional error to the right-hand side. Unbiasedness does not hold though when the dependent variable is limited in certain ways, as it is in the case of a logit or probit specification where \({y}_{i}{\in }\{0,1\}\). In such cases, the coefficients estimated via maximum likelihood will be biased.

As discussed in Sect. 4.1, there are known methods for correcting for this bias by simultaneously estimating both the model coefficients and the rate of mislabeling in the sample (cf., Hausman et al. 1998).

Finally, if there are “bad” records missing from the development sample, the situation is that of a sample base rate that differs from the true base rate (and is lower). In this case, estimation is still feasible, but the predicted probabilities will be biased downward due to the “lower” prevalence in the estimation data. Elkan (2001) provides a proof that when the base rates of the development sample and the evaluation sample differ, the observed probability \({p}_{i}^{*}\), for the \(i\mathrm{th}\) observation in the evaluation sample is calculated as:

where \({p}_{i}\) is the raw probability produced by the model, \({\pi }_{S}\) is the baseline prevalence rate in the development sample, and \({\pi }_{T}\) is the baseline prevalence in the evaluation sample (Note that we adopt the notation used in Bohn and Stein (2009) in preference to that of Elkan (2001)).

Bohn and Stein (2009) provides a discussion of this approach as well as examples.

Note that the observations in most of this Appendix is premised on the data corruption being CAR or MCAR. If it is not the case that one of these assumption holds, alternative approaches are available in some cases, but the estimators typically become much more involved. This is also a problem area that has been studied extensively (see, e.g., (York 1969) for an early reference).

### Appendix B: Recovering an approximate ROC curve (under parametric assumptions)

Some readers may find it useful to construct ROC curves in addition to calculating AUC statistics. It is feasible to generate representative ROC curves that are, under certain assumptions, consistent with the recovered values of the AUC, \(\widehat{A_r}\). Because, in general, we do not know which records are corrupted (if we did, we would correct them), we cannot easily determine the recovered values of \((\widehat{FN}_r, \widehat{TP}_r)\), at individual points on the ROC. However, subject to assumptions about the distribution of the model scores, we can generate a parametric ROC.

If we make the assumption that *G* and *B* are distributed normally with means \(\mu _G, \mu _B\) and variances \(\sigma ^2_G, \sigma ^2_B\), respectively, it can then be shown that

where \(a=(\mu _B- \mu _G)/\sigma ^2_B, b=\sigma ^2_G/\sigma ^2_B\) and \({\varPhi }(\cdot )\) and \({\varPhi }^{-1}(\cdot )\) are the cumulative and inverse cumulative normal distribution functions, respectively. It then follows that

Finally, we can solve for each of either *a* or *b* in terms of the other (and *A*):

and

We *do not* know which records are corrupted so there is little guidance on how to adjust the estimates of the \(\mu \) and \(\sigma ^2\) parameters. Though there is no theoretical motivation for them, we present two heuristic approaches for plotting the recovered ROC curve. Both focus on estimating *a* based on assumptions about *b*.

The first is the simpler of the two and assumes that the variances of *G* and *B* do not change as a result of the corruption. Thus, we would plug in the estimates of \(\sigma ^2_G\) and \(\sigma ^2_B\), from the corrupted data.

The second, and more restrictive, assumes that the variances are equal, and thus that \(b=\sigma ^2_G/\sigma ^2_B=1\). Then

In either case, the ROC curve may be generated by assuming a value for *b*, solving for *a*, and then calculating Eq. 15 for values of *u* in [0, 1].

###
*Example 10*

(Recovering an estimated ROC from corupted data using an estimate of \(\hat{A}_r\)) Consider again the analyst from Example 4 who is evaluating a bankruptcy model using a corrupted dataset of anonymized financial statement data and corresponding default flags. In Example 4 the analyst calculated that \(\widehat{A}_r= 0.814\) (based on the original estimate of \(\hat{A}_c=0.73\)).

If the analyst wished to produce an estimated ROC curve, and were comfortable making the assumption that the variances of the model scores for the defaulting and non-defaulting firms were equal, he could use Eq. 19 to estimate *a*, and then use Eq. 15 with \(b=1\) to generate the ROC curve.

The resulting curve is shown in Fig. 6. For comparison, the same curve, plotted for the original estimate of \(\widehat{A_c}=0.73\) is shown as a dashed line.

We note that because of the various assumptions, ROC curves generated in this fashion will likely differ from those that would have been recovered, e.g., nonparametrically, were the corruption mechanism known. However, for some applications, and with caveats about the strong assumptions above, this approach may still provide helpful visual guidance.

## Rights and permissions

## About this article

### Cite this article

Stein, R.M. Evaluating discrete choice prediction models when the evaluation data is corrupted: analytic results and bias corrections for the area under the ROC.
*Data Min Knowl Disc* **30**, 763–796 (2016). https://doi.org/10.1007/s10618-015-0437-7

Received:

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s10618-015-0437-7

### Keywords

- ROC
- Model validation
- Prediction
- Data corruption
- Bias correction
- Misclassification
- Credit models
- Machine learning