Skip to main content
Log in

Fieller Stability Measure: a novel model-dependent backtesting approach

  • General Paper
  • Published:
Journal of the Operational Research Society

Abstract

Dataset shift is present in almost all real-world applications, since most of them are constantly dealing with changing environments. Detecting fractures in datasets on time allows recalibrating the models before a significant decrease in the model’s performance is observed. Since small changes are normal in most applications and do not justify the efforts that a model recalibration requires, we are only interested in identifying those changes that are critical for the correct functioning of the model. In this work we propose a model-dependent backtesting strategy designed to identify significant changes in the covariates, relating a confidence zone of the change to a maximal deviance measure obtained from the coefficients of the model. Using logistic regression as a predictive approach, we performed experiments on simulated data, and on a real-world credit scoring dataset. The results show that the proposed method has better performance than traditional approaches, consistently identifying major changes in variables while taking into account important characteristics of the problem, such as sample sizes and variances, and uncertainty in the coefficients.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5

Similar content being viewed by others

References

  • Anderson R (2007). The Credit Scoring Toolkit. Oxford University Press: New York, USA.

    Google Scholar 

  • Baesens B (2014). Analytics in a Big Data World. John Wiley and Sons: New York, USA.

    Google Scholar 

  • Baesens B, Mues C, Martens D and Vanthienen J (2009). 50 years of data mining and or: Upcoming trends and challenges. Journal of the Operational Research Society 60 (S1): 16–23.

    Article  Google Scholar 

  • Basu A, Harris IR and Basu S (1997). Minimum distance estimation: The approach using density-based distances. Handbook of Statistics. Vol. 15: Robust Inference Elsevier: North Holland, Netherlands, pp 21–48.

    Google Scholar 

  • Bergtold J, Yeager E and Featherstone A (2011). Sample size and robustness of inferences from logistic regression in the presence of nonlinearity and multicollinearity. In: Proceedings of the Agricultural & Applied Economics Associations 2011 AAEA & NAREA Joint Annual Meeting. Pittsburg, Pennsylvania, USA.

  • Beyene J and Moineddin R (2005). Methods for confidence interval estimation of a ratio parameter with application to location quotients. BMC Medical Research Methodology 5 (32): 1–7.

    Google Scholar 

  • Birón M and Bravo C (2014). On the discriminative power of credit scoring systems trained on independent samples. Data Analysis, Machine Learning and Knowledge Discovery. Springer International Publishing, pp 247–254.

    Chapter  Google Scholar 

  • Bravo C, Maldonado S and Weber R (2013). Granting and managing loans for micro-entrepreneurs: New developments and practical experiences. European Journal of Operational Research 227 (2): 358–366.

    Article  Google Scholar 

  • Castermans G, Hamers B, Van Gestel T and Baesens B (2010). An overview and framework for PD backtesting and benchmarking. The Journal of the Operational Research Society 61 (3): 359–373.

    Article  Google Scholar 

  • Cieslak D and Chawla N (2007). Detecting fractures in classifier performance. In: Proceedings of the Seventh IEEE International Conference on Data Mining, Department of Computer Science and Engineering, University of Notredame, Indiana, USA, pp 123–132.

  • Fieller EC (1954). Some problems in interval estimation. Journal of the Royal Statistical Society, Series B 16 (2): 175–185.

    Google Scholar 

  • Hofer V and Krempl G (2013). Drift mining in data: A framework for addressing drift in classification. Computational Statistics and Data Analysis 57 (1): 377–391.

    Article  Google Scholar 

  • Hosmer D and Lemeshow H (2000). Applied Logistic Regression. John Wiley & Sons: Hoboken, New Jersey, USA.

    Book  Google Scholar 

  • Kelly M, Hand D and Adams N (1999). The impact of changing populations on classifier performance. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Diego, California, USA, pp 367–371.

  • Lewis EM (1992). An Introduction to Credit Scoring. Fair, Isaac & Co., Inc: California, USA.

    Google Scholar 

  • Lima E, Mues C and Baesens B (2011). Monitoring and backtesting churn models. Expert Systems with Applications 38 (1): 975–982.

    Article  Google Scholar 

  • Moreno-Torres JG, Raeder TR, Aláiz-Rodríguez R, Chawla NV and Herrera F (2012). A unifying view on dataset shift in classification. Pattern Recognition 45 (1): 521–530.

    Article  Google Scholar 

  • Quiñonero Candela J, Sugiyama M, Schwaighofer A and Lawrence ND (eds). (2009). Dataset Shift in Machine Learning. MIT Press: Cambridge, Massachusetts, USA.

    Google Scholar 

  • Robinson S, Brooks R and Lewis C (2002). Detecting shifts in the mean of a simulation output process. Journal of the Operational Research Society 53 (5): 559–573.

    Article  Google Scholar 

  • Schenker N and Gentleman JF (2001). On judging the significance of differences by examining the overlap between confidence intervals. The American Statistician 55 (3): 182–186.

    Article  Google Scholar 

  • Schlimmer J and Granger R (1986). Beyond incremental processing: tracking concept drift. In: Proceedings of the Fifth National Conference on Artificial Intelligence. San Francisco, CA, USA, pp 502–507.

  • Siddiqi N (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. John Wiley and Sons: Hoboken, New Jersey, USA.

    Google Scholar 

  • Smirnov N (1948). Tables for estimating the goodness of fit of empirical distributions. Annals of Mathematical Statistics 19 (2): 279–281.

    Article  Google Scholar 

  • Utts JM and Heckard RF (2012). Mind on Statistics. 4th edn, Cengage Learning: Belmont, California, USA.

    Google Scholar 

  • Yang Y, Wu X and Zhu X (2008). Conceptual equivalence for contrast mining in classification learning. Data and Knowledge Engineering 67 (3): 413–429.

    Article  Google Scholar 

Download references

Acknowledgements

The first author acknowledges the support of CONICYT Becas Chile PD-74140041. The second author was supported by CONICYT FONDECYT Initiation into Research 11121196. Both authors acknowledge the support of the Institute of Complex Engineering Systems (ICM: P-05-004- F, CONICYT: FBO16).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cristián Bravo.

Appendix

Appendix

Results for the experiments on real data

The following graphs present the results of the dataset shift tests (the proposed approach and SI) for the six remaining variables (Variables 4–9).

Figure A1

Figure A1
figure 6

(a) Relevant shift (Var4).

(b) Slight shift (Var5).

(c) Slight shift (Var6).

(d) Slight shift (Var7).

Figure A2

Figure A2
figure 7

(a) No shift (Var8).

(b) Severe shift (Var9).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bravo, C., Maldonado, S. Fieller Stability Measure: a novel model-dependent backtesting approach. J Oper Res Soc 66, 1895–1905 (2015). https://doi.org/10.1057/jors.2015.18

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1057/jors.2015.18

Keywords

Navigation