Abstract
In the consumer credit industry, assessment of default risk is critically important for the financial health of both the lender and the borrower. Methods for predicting risk for an applicant using credit bureau and application data, typically based on logistic regression or survival analysis, are universally employed by credit card companies. Because of the manner in which the predictive models are fit using large historical sets of existing customer data that extend over many years, default trends, anomalies, and other temporal phenomena that result from dynamic economic conditions are not brought to light. We introduce a modification of the proportional hazards survival model that includes a time-dependency mechanism for capturing temporal phenomena, and we develop a maximum likelihood algorithm for fitting the model. Using a very large, real data set, we demonstrate that incorporating the time dependency can provide more accurate risk scoring, as well as important insight into dynamic market effects that can inform and enhance related decision making.
Similar content being viewed by others
References
Andreeva G, Ansell J and Crook JN (2007). Modeling profitability using survival combination scores. Eur J Opl Res 183: 1537–1549.
Baesens B, Van Gestel T, Stepanova M, Van den Poel D and Vanthienen J (2005). Neural network survival analysis for personal loan data. J Opl Res Soc 56: 1089–1098.
Baesens B, Van Gestel T, Viaene S, Stepanova M, Suykens J and Vanthienen J (2003). Benchmarking state-of-the-art classification algorithms for credit scoring. J Opl Res Soc 54: 627–635.
Banasik J, Crook JN and Thomas LC (1999). Not if but when will borrowers default. J Opl Res Soc 50: 1185–1190.
Batista GE, Prati RC and Monard MC (2004). A study of the behavior of several methods for balancing machine learning training data. Sigkdd Explor 6: 20–29.
Bellotti T and Crook JN (2009). Credit scoring with macroeconomic variables using survival analysis. J Opl Res Soc 60: 1699–1707.
Breeden JL (2007). Modeling data with multiple time dimensions. Comp Stat Data Anal 51: 4761–4785.
Breeden JL and Thomas LC (2008). The relationship between default and economic cycle across countries for retail portfolios. J Risk Model Validation 2 (3): 11–44.
Breeden JL, Thomas LC and McDonald JW (2008). Stress-testing retail loan portfolios with dual-time dynamics. J Risk Model Validation 2 (2): 43–62.
Chawla NV, Japkowicz N and Kolcz A (2004). Editorial: Special issue on learning from imbalanced data sets. Sigkdd Explor 6: 1–6.
Crook JN, Edelman DB and Thomas LC (2007). Recent developments in consumer credit risk management. Eur J Opl Res 183: 1447–1465.
Hand DJ (2006). Classifier technology and the illusion of progress. Stat Sci 21: 1–14.
Hand DJ and Henley WE (1997). Statistical classification methods in consumer credit scoring: A review. J R Stat Soc Ser A 160: 523–541.
Leemis LM (1995). Reliability: Probabilistic Models and Statistical Methods. Prentice-Hall: Englewood Cliffs, NJ.
Meeker WQ and Escobar LA (1998). Statistical Methods for Reliability Data. Wiley-Interscience: New York.
Rosenberg E and Gleit A (1994). Quantitative methods in credit management: A survey. Opns Res 42: 589–613.
Stepanova M and Thomas L C (2001). PHAB scores: Proportional hazards analysis behavioural scores. J Opl Res Soc 52: 1007–1016.
Stepanova M and Thomas LC (2002). Survival analysis methods for personal loan data. Opns Res 50: 277–289.
Tang L, Thomas LC, Thomas S and Bozzeto J-F (2007). It's the economy stupid: Modelling financial product purchases. Int J Bank Market 25: 22–38.
Thomas LC (2000). A survey of credit and behavioural scoring: Forecasting financial risk of lending to consumers. Int J Forecast 16: 149–172.
Thomas LC, Edelman DB and Crook JN (2002). Credit Scoring and its Applications. SIAM: Philadelphia, PA.
Thomas LC, Edelman DB and Crook JN (2004). Readings in Credit Scoring: Recent Developments, Advances, and Aims. Oxford University Press: Oxford, UK.
Thomas LC, Oliver RW and Hand DJ (2005). A survey of the issues in consumer credit modelling research. J Opl Res Soc 56: 1006–1015.
Acknowledgements
The authors gratefully acknowledge numerous helpful comments from two anonymous referees and the editor.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A
Some data treatment issues
The data are described in Section 5. Ten predictor variables were chosen to include in the model, but due to confidentiality concerns, we do not describe them in this paper. We recommend standardizing all variables before implementing the MLE algorithm, for numerical reasons. Scaling of the variables will affect the direction in which a gradient-based algorithm searches for solutions to the parameter estimates. Regarding missing data, which are inevitable in large data sets, we used the following strategy. Predictor variable values that were missing for a customer were substituted with a linear regression prediction of the missing value, based on a covariance matrix calculated from all data that were not missing.
Appendix B
Derivation of the TDPH cdf
As mentioned in Section 4, the TDPH cdf of T can be written as F(t; x, τ; h 0, ψ, γ)=1 − exp(−∫t 0 h 0(u)ψ(x) γ(τ + u)du). We break the integral up into components that correspond to the pieces over which γ is constant. Let w(τ, t)=⌈(τ+t)/3⌉−⌈τ/3⌉+1 denote the number of different γ pieces between the time of approval (τ) and the time in question (τ+t). The ‘3’ in the denominator arises from the fact that t is in units of months, and the γ function is modelled as constant over each quarter. For a customer joining in month τ, the intervals (in terms of the relative month t) over which γ is constant can be written as
Here ⌈·⌉ denotes the ceiling function returning the smallest integer not less than the argument. In the preceding expressions for the intervals, it is understood that the lower boundary of each is truncated at 0 and the upper boundary at t. The constant value of γ(·) over the interval I τ, k is γ ⌈τ / 3⌉+k−1. Note that s τ , 0=0 and s τ , w(τ , t)=t by definition. The cdf F(t; x, τ; h 0 , ψ, γ) of T becomes
where H 0(t)= ∫t 0 h 0(u)du = −log(1 − F 0(t)) is the cumulative hazard function of the baseline distribution.
Appendix C
The log-likelihood and its gradient for lognormal h 0 ( t ) and exponential ψ (x)
In this section, we derive the log-likelihood function and its gradient (for use in a gradient-based algorithm for maximizing the likelihood) for the special case of a lognormal h 0(t) and an exponential ψ(x). For other baseline distributions or ψ(x) functions, a similar approach can be used. For our special case, the baseline cdf and pdf are F 0(t)=Φ((log(t)−μ)/σ) and f 0(t)=φ((log(t)−μ)/σ)/(tσ) with mean parameter μ and standard deviation parameter σ, where Φ and φ denote the standard normal cdf and pdf. In order to avoid identifiability problems due to the confounding between and γ, we define the exponential function as ψ(x)=eβ ′ x, instead of For notational simplicity, we denote F(t; x i , τ i ; μ, σ, β, γ) by F i (t). From Section 4, the log-likelihood function for the entire data set is
where expressions for the F i (t) are given in Appendix B.
To find the partial derivatives of l with respect to μ, σ, β, and γ, notice that l depends on μ, σ, and γ via the terms (i=1, 2, …, N), and l depends on β via the terms ψ(x i ) (i=1, 2, …, N). The relevant partial derivatives of J i (t) and ψ(x i ) are obtained from
,
From the expressions in Appendix B, the partial derivatives of F i (t) are obtained from
From Equation (C.1), we also have
for j=0, 1, 2, …, M, and
for q=1, 2, …, Q.
Combining all of these gives the partial derivatives that constitute the gradient of l with respect to the parameters. The gradient can be used in an optimization algorithm for calculating the MLEs of the parameters. Matlab code is available upon request from the authors for the special case considered in this appendix.
Rights and permissions
About this article
Cite this article
Im, JK., Apley, D., Qi, C. et al. A time-dependent proportional hazards survival model for credit risk analysis. J Oper Res Soc 63, 306–321 (2012). https://doi.org/10.1057/jors.2011.34
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1057/jors.2011.34