Skip to main content
Log in

A time-dependent proportional hazards survival model for credit risk analysis

  • General Paper
  • Published:
Journal of the Operational Research Society

Abstract

In the consumer credit industry, assessment of default risk is critically important for the financial health of both the lender and the borrower. Methods for predicting risk for an applicant using credit bureau and application data, typically based on logistic regression or survival analysis, are universally employed by credit card companies. Because of the manner in which the predictive models are fit using large historical sets of existing customer data that extend over many years, default trends, anomalies, and other temporal phenomena that result from dynamic economic conditions are not brought to light. We introduce a modification of the proportional hazards survival model that includes a time-dependency mechanism for capturing temporal phenomena, and we develop a maximum likelihood algorithm for fitting the model. Using a very large, real data set, we demonstrate that incorporating the time dependency can provide more accurate risk scoring, as well as important insight into dynamic market effects that can inform and enhance related decision making.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12

Similar content being viewed by others

References

  • Andreeva G, Ansell J and Crook JN (2007). Modeling profitability using survival combination scores. Eur J Opl Res 183: 1537–1549.

    Article  Google Scholar 

  • Baesens B, Van Gestel T, Stepanova M, Van den Poel D and Vanthienen J (2005). Neural network survival analysis for personal loan data. J Opl Res Soc 56: 1089–1098.

    Article  Google Scholar 

  • Baesens B, Van Gestel T, Viaene S, Stepanova M, Suykens J and Vanthienen J (2003). Benchmarking state-of-the-art classification algorithms for credit scoring. J Opl Res Soc 54: 627–635.

    Article  Google Scholar 

  • Banasik J, Crook JN and Thomas LC (1999). Not if but when will borrowers default. J Opl Res Soc 50: 1185–1190.

    Article  Google Scholar 

  • Batista GE, Prati RC and Monard MC (2004). A study of the behavior of several methods for balancing machine learning training data. Sigkdd Explor 6: 20–29.

    Article  Google Scholar 

  • Bellotti T and Crook JN (2009). Credit scoring with macroeconomic variables using survival analysis. J Opl Res Soc 60: 1699–1707.

    Article  Google Scholar 

  • Breeden JL (2007). Modeling data with multiple time dimensions. Comp Stat Data Anal 51: 4761–4785.

    Article  Google Scholar 

  • Breeden JL and Thomas LC (2008). The relationship between default and economic cycle across countries for retail portfolios. J Risk Model Validation 2 (3): 11–44.

    Article  Google Scholar 

  • Breeden JL, Thomas LC and McDonald JW (2008). Stress-testing retail loan portfolios with dual-time dynamics. J Risk Model Validation 2 (2): 43–62.

    Article  Google Scholar 

  • Chawla NV, Japkowicz N and Kolcz A (2004). Editorial: Special issue on learning from imbalanced data sets. Sigkdd Explor 6: 1–6.

    Article  Google Scholar 

  • Crook JN, Edelman DB and Thomas LC (2007). Recent developments in consumer credit risk management. Eur J Opl Res 183: 1447–1465.

    Article  Google Scholar 

  • Hand DJ (2006). Classifier technology and the illusion of progress. Stat Sci 21: 1–14.

    Article  Google Scholar 

  • Hand DJ and Henley WE (1997). Statistical classification methods in consumer credit scoring: A review. J R Stat Soc Ser A 160: 523–541.

    Article  Google Scholar 

  • Leemis LM (1995). Reliability: Probabilistic Models and Statistical Methods. Prentice-Hall: Englewood Cliffs, NJ.

    Google Scholar 

  • Meeker WQ and Escobar LA (1998). Statistical Methods for Reliability Data. Wiley-Interscience: New York.

    Google Scholar 

  • Rosenberg E and Gleit A (1994). Quantitative methods in credit management: A survey. Opns Res 42: 589–613.

    Article  Google Scholar 

  • Stepanova M and Thomas L C (2001). PHAB scores: Proportional hazards analysis behavioural scores. J Opl Res Soc 52: 1007–1016.

    Article  Google Scholar 

  • Stepanova M and Thomas LC (2002). Survival analysis methods for personal loan data. Opns Res 50: 277–289.

    Article  Google Scholar 

  • Tang L, Thomas LC, Thomas S and Bozzeto J-F (2007). It's the economy stupid: Modelling financial product purchases. Int J Bank Market 25: 22–38.

    Article  Google Scholar 

  • Thomas LC (2000). A survey of credit and behavioural scoring: Forecasting financial risk of lending to consumers. Int J Forecast 16: 149–172.

    Article  Google Scholar 

  • Thomas LC, Edelman DB and Crook JN (2002). Credit Scoring and its Applications. SIAM: Philadelphia, PA.

    Book  Google Scholar 

  • Thomas LC, Edelman DB and Crook JN (2004). Readings in Credit Scoring: Recent Developments, Advances, and Aims. Oxford University Press: Oxford, UK.

    Google Scholar 

  • Thomas LC, Oliver RW and Hand DJ (2005). A survey of the issues in consumer credit modelling research. J Opl Res Soc 56: 1006–1015.

    Article  Google Scholar 

Download references

Acknowledgements

The authors gratefully acknowledge numerous helpful comments from two anonymous referees and the editor.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to D W Apley.

Appendices

Appendix A

Some data treatment issues

The data are described in Section 5. Ten predictor variables were chosen to include in the model, but due to confidentiality concerns, we do not describe them in this paper. We recommend standardizing all variables before implementing the MLE algorithm, for numerical reasons. Scaling of the variables will affect the direction in which a gradient-based algorithm searches for solutions to the parameter estimates. Regarding missing data, which are inevitable in large data sets, we used the following strategy. Predictor variable values that were missing for a customer were substituted with a linear regression prediction of the missing value, based on a covariance matrix calculated from all data that were not missing.

Appendix B

Derivation of the TDPH cdf

As mentioned in Section 4, the TDPH cdf of T can be written as F(t; x, τ; h 0, ψ, γ)=1 − exp(−∫t 0 h 0(u)ψ(x) γ(τ + u)du). We break the integral up into components that correspond to the pieces over which γ is constant. Let w(τ, t)=⌈(τ+t)/3⌉−⌈τ/3⌉+1 denote the number of different γ pieces between the time of approval (τ) and the time in question (τ+t). The ‘3’ in the denominator arises from the fact that t is in units of months, and the γ function is modelled as constant over each quarter. For a customer joining in month τ, the intervals (in terms of the relative month t) over which γ is constant can be written as

Here ⌈·⌉ denotes the ceiling function returning the smallest integer not less than the argument. In the preceding expressions for the intervals, it is understood that the lower boundary of each is truncated at 0 and the upper boundary at t. The constant value of γ(·) over the interval I τ, k is γτ / 3⌉+k−1. Note that s τ , 0=0 and s τ ,  w(τ ,  t)=t by definition. The cdf F(t; x, τ; h 0 , ψ, γ) of T becomes

where H 0(t)= ∫t 0 h 0(u)du = −log(1 − F 0(t)) is the cumulative hazard function of the baseline distribution.

Appendix C

The log-likelihood and its gradient for lognormal h 0 ( t ) and exponential ψ (x)

In this section, we derive the log-likelihood function and its gradient (for use in a gradient-based algorithm for maximizing the likelihood) for the special case of a lognormal h 0(t) and an exponential ψ(x). For other baseline distributions or ψ(x) functions, a similar approach can be used. For our special case, the baseline cdf and pdf are F 0(t)=Φ((log(t)−μ)/σ) and f 0(t)=φ((log(t)−μ)/σ)/() with mean parameter μ and standard deviation parameter σ, where Φ and φ denote the standard normal cdf and pdf. In order to avoid identifiability problems due to the confounding between and γ, we define the exponential function as ψ(x)=eβ x, instead of For notational simplicity, we denote F(t; x i , τ i ; μ, σ, β, γ) by F i (t). From Section 4, the log-likelihood function for the entire data set is

where expressions for the F i (t) are given in Appendix B.

To find the partial derivatives of l with respect to μ, σ, β, and γ, notice that l depends on μ, σ, and γ via the terms (i=1, 2, …, N), and l depends on β via the terms ψ(x i ) (i=1, 2, …, N). The relevant partial derivatives of J i (t) and ψ(x i ) are obtained from

,

From the expressions in Appendix B, the partial derivatives of F i (t) are obtained from

From Equation (C.1), we also have

for j=0, 1, 2, …, M, and

for q=1, 2, …, Q.

Combining all of these gives the partial derivatives that constitute the gradient of l with respect to the parameters. The gradient can be used in an optimization algorithm for calculating the MLEs of the parameters. Matlab code is available upon request from the authors for the special case considered in this appendix.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Im, JK., Apley, D., Qi, C. et al. A time-dependent proportional hazards survival model for credit risk analysis. J Oper Res Soc 63, 306–321 (2012). https://doi.org/10.1057/jors.2011.34

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1057/jors.2011.34

Keywords

Navigation