A time-dependent proportional hazards survival model for credit risk analysis

Im, J-K; Apley, D W; Qi, C; Shan, X

doi:10.1057/jors.2011.34

A time-dependent proportional hazards survival model for credit risk analysis

General Paper
Published: 11 May 2011

Volume 63, pages 306–321, (2012)
Cite this article

Journal of the Operational Research Society

J-K Im¹,
D W Apley¹,
C Qi² &
…
X Shan³

209 Accesses
33 Citations
Explore all metrics

Abstract

In the consumer credit industry, assessment of default risk is critically important for the financial health of both the lender and the borrower. Methods for predicting risk for an applicant using credit bureau and application data, typically based on logistic regression or survival analysis, are universally employed by credit card companies. Because of the manner in which the predictive models are fit using large historical sets of existing customer data that extend over many years, default trends, anomalies, and other temporal phenomena that result from dynamic economic conditions are not brought to light. We introduce a modification of the proportional hazards survival model that includes a time-dependency mechanism for capturing temporal phenomena, and we develop a maximum likelihood algorithm for fitting the model. Using a very large, real data set, we demonstrate that incorporating the time dependency can provide more accurate risk scoring, as well as important insight into dynamic market effects that can inform and enhance related decision making.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Machine learning-driven credit risk: a systemic review

Article Open access 16 July 2022

Algorithmic discrimination in the credit domain: what do we know about it?

Article Open access 17 May 2023

Explainable Machine Learning in Credit Risk Management

Article Open access 25 September 2020

References

Andreeva G, Ansell J and Crook JN (2007). Modeling profitability using survival combination scores. Eur J Opl Res 183: 1537–1549.
Article Google Scholar
Baesens B, Van Gestel T, Stepanova M, Van den Poel D and Vanthienen J (2005). Neural network survival analysis for personal loan data. J Opl Res Soc 56: 1089–1098.
Article Google Scholar
Baesens B, Van Gestel T, Viaene S, Stepanova M, Suykens J and Vanthienen J (2003). Benchmarking state-of-the-art classification algorithms for credit scoring. J Opl Res Soc 54: 627–635.
Article Google Scholar
Banasik J, Crook JN and Thomas LC (1999). Not if but when will borrowers default. J Opl Res Soc 50: 1185–1190.
Article Google Scholar
Batista GE, Prati RC and Monard MC (2004). A study of the behavior of several methods for balancing machine learning training data. Sigkdd Explor 6: 20–29.
Article Google Scholar
Bellotti T and Crook JN (2009). Credit scoring with macroeconomic variables using survival analysis. J Opl Res Soc 60: 1699–1707.
Article Google Scholar
Breeden JL (2007). Modeling data with multiple time dimensions. Comp Stat Data Anal 51: 4761–4785.
Article Google Scholar
Breeden JL and Thomas LC (2008). The relationship between default and economic cycle across countries for retail portfolios. J Risk Model Validation 2 (3): 11–44.
Article Google Scholar
Breeden JL, Thomas LC and McDonald JW (2008). Stress-testing retail loan portfolios with dual-time dynamics. J Risk Model Validation 2 (2): 43–62.
Article Google Scholar
Chawla NV, Japkowicz N and Kolcz A (2004). Editorial: Special issue on learning from imbalanced data sets. Sigkdd Explor 6: 1–6.
Article Google Scholar
Crook JN, Edelman DB and Thomas LC (2007). Recent developments in consumer credit risk management. Eur J Opl Res 183: 1447–1465.
Article Google Scholar
Hand DJ (2006). Classifier technology and the illusion of progress. Stat Sci 21: 1–14.
Article Google Scholar
Hand DJ and Henley WE (1997). Statistical classification methods in consumer credit scoring: A review. J R Stat Soc Ser A 160: 523–541.
Article Google Scholar
Leemis LM (1995). Reliability: Probabilistic Models and Statistical Methods. Prentice-Hall: Englewood Cliffs, NJ.
Google Scholar
Meeker WQ and Escobar LA (1998). Statistical Methods for Reliability Data. Wiley-Interscience: New York.
Google Scholar
Rosenberg E and Gleit A (1994). Quantitative methods in credit management: A survey. Opns Res 42: 589–613.
Article Google Scholar
Stepanova M and Thomas L C (2001). PHAB scores: Proportional hazards analysis behavioural scores. J Opl Res Soc 52: 1007–1016.
Article Google Scholar
Stepanova M and Thomas LC (2002). Survival analysis methods for personal loan data. Opns Res 50: 277–289.
Article Google Scholar
Tang L, Thomas LC, Thomas S and Bozzeto J-F (2007). It's the economy stupid: Modelling financial product purchases. Int J Bank Market 25: 22–38.
Article Google Scholar
Thomas LC (2000). A survey of credit and behavioural scoring: Forecasting financial risk of lending to consumers. Int J Forecast 16: 149–172.
Article Google Scholar
Thomas LC, Edelman DB and Crook JN (2002). Credit Scoring and its Applications. SIAM: Philadelphia, PA.
Book Google Scholar
Thomas LC, Edelman DB and Crook JN (2004). Readings in Credit Scoring: Recent Developments, Advances, and Aims. Oxford University Press: Oxford, UK.
Google Scholar
Thomas LC, Oliver RW and Hand DJ (2005). A survey of the issues in consumer credit modelling research. J Opl Res Soc 56: 1006–1015.
Article Google Scholar

Download references

Acknowledgements

The authors gratefully acknowledge numerous helpful comments from two anonymous referees and the editor.

Author information

Authors and Affiliations

Northwestern University, Evanston, IL, USA
J-K Im & D W Apley
Consultant, Lake Forest, IL, USA
C Qi
Consultant, Deerfield, IL, USA
X Shan

Authors

J-K Im
View author publications
You can also search for this author in PubMed Google Scholar
D W Apley
View author publications
You can also search for this author in PubMed Google Scholar
C Qi
View author publications
You can also search for this author in PubMed Google Scholar
X Shan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to D W Apley.

Appendices

Appendix A

Some data treatment issues

The data are described in Section 5. Ten predictor variables were chosen to include in the model, but due to confidentiality concerns, we do not describe them in this paper. We recommend standardizing all variables before implementing the MLE algorithm, for numerical reasons. Scaling of the variables will affect the direction in which a gradient-based algorithm searches for solutions to the parameter estimates. Regarding missing data, which are inevitable in large data sets, we used the following strategy. Predictor variable values that were missing for a customer were substituted with a linear regression prediction of the missing value, based on a covariance matrix calculated from all data that were not missing.

Appendix B

Derivation of the TDPH cdf

As mentioned in Section 4, the TDPH cdf of T can be written as F(t; x, τ; h ₀, ψ, γ)=1 − exp(−∫^t ₀ h ₀(u)ψ(x) γ(τ + u)du). We break the integral up into components that correspond to the pieces over which γ is constant. Let w(τ, t)=⌈(τ+t)/3⌉−⌈τ/3⌉+1 denote the number of different γ pieces between the time of approval (τ) and the time in question (τ+t). The ‘3’ in the denominator arises from the fact that t is in units of months, and the γ function is modelled as constant over each quarter. For a customer joining in month τ, the intervals (in terms of the relative month t) over which γ is constant can be written as

Here ⌈·⌉ denotes the ceiling function returning the smallest integer not less than the argument. In the preceding expressions for the intervals, it is understood that the lower boundary of each is truncated at 0 and the upper boundary at t. The constant value of γ(·) over the interval I _τ, _k is γ _{⌈τ / 3⌉+k−1}. Note that s _τ, ₀=0 and s _τ, _w(τ, _t)=t by definition. The cdf F(t; x, τ; h ₀, ψ, γ) of T becomes

where H ₀(t)= ∫^t ₀ h ₀(u)du = −log(1 − F ₀(t)) is the cumulative hazard function of the baseline distribution.

Appendix C

The log-likelihood and its gradient for lognormal h ₀ ( t ) and exponential ψ (x)

In this section, we derive the log-likelihood function and its gradient (for use in a gradient-based algorithm for maximizing the likelihood) for the special case of a lognormal h ₀(t) and an exponential ψ(x). For other baseline distributions or ψ(x) functions, a similar approach can be used. For our special case, the baseline cdf and pdf are F ₀(t)=Φ((log(t)−μ)/σ) and f ₀(t)=φ((log(t)−μ)/σ)/(tσ) with mean parameter μ and standard deviation parameter σ, where Φ and φ denote the standard normal cdf and pdf. In order to avoid identifiability problems due to the confounding between and γ, we define the exponential function as ψ(x)=e^β ^′ ^x, instead of For notational simplicity, we denote F(t; x _i, τ _i; μ, σ, β, γ) by F _i(t). From Section 4, the log-likelihood function for the entire data set is

where expressions for the F _i(t) are given in Appendix B.

To find the partial derivatives of l with respect to μ, σ, β, and γ, notice that l depends on μ, σ, and γ via the terms (i=1, 2, …, N), and l depends on β via the terms ψ(x _i) (i=1, 2, …, N). The relevant partial derivatives of J _i(t) and ψ(x _i) are obtained from

,

From the expressions in Appendix B, the partial derivatives of F _i(t) are obtained from

From Equation (C.1), we also have

for j=0, 1, 2, …, M, and

for q=1, 2, …, Q.

Combining all of these gives the partial derivatives that constitute the gradient of l with respect to the parameters. The gradient can be used in an optimization algorithm for calculating the MLEs of the parameters. Matlab code is available upon request from the authors for the special case considered in this appendix.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Im, JK., Apley, D., Qi, C. et al. A time-dependent proportional hazards survival model for credit risk analysis. J Oper Res Soc 63, 306–321 (2012). https://doi.org/10.1057/jors.2011.34

Download citation

Received: 01 March 2010
Accepted: 01 February 2011
Published: 11 May 2011
Issue Date: 01 March 2012
DOI: https://doi.org/10.1057/jors.2011.34

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A time-dependent proportional hazards survival model for credit risk analysis

Abstract

Access this article

Similar content being viewed by others

Machine learning-driven credit risk: a systemic review

Algorithmic discrimination in the credit domain: what do we know about it?

Explainable Machine Learning in Credit Risk Management

References

Acknowledgements