Time to default in credit scoring using survival analysis: a benchmark study

Abstract We investigate the performance of various survival analysis techniques applied to ten actual credit data sets from Belgian and UK financial institutions. In the comparison we consider classical survival analysis techniques, namely the accelerated failure time models and Cox proportional hazards regression models, as well as Cox proportional hazards regression models with splines in the hazard function. Mixture cure models for single and multiple events were more recently introduced in the credit risk context. The performance of these models is evaluated using both a statistical evaluation and an economic approach through the use of annuity theory. It is found that spline-based methods and the single event mixture cure model perform well in the credit risk context.


Introduction
With the introduction of compliance guidelines such as Basel II and Basel III, and the resulting higher need for more accurate credit risk calculations, survival analysis gained more importance over the recent years. Historically, survival analysis is mainly used in the medical context as well as in engineering, where the time duration until an event is analyzed, for example the time until death or machine failure (see Kalbfleisch and Prentice, 2002;Collett, 2003;Cox and Oakes, 1984).
As an alternative to logistic regression, Narain (1992) first introduced the idea of using survival analysis in the credit risk context. The advantage of using survival analysis in this context is that the time to default can be modeled, and not just whether an applicant will default or not . Many authors followed the example of Narain (1992) and started to use more advanced methods as compared to the parametric accelerated failure time (AFT) survival methods used in this first work. An overview is given in Table 1. With its flexible nonparametric baseline hazard, the Cox proportional hazards (Cox PH) model was an obvious first alternative to the AFT model (Banasik et al, 1999), and subsequent contributions extended both Cox PH and AFT models by using, among others, coarse classification (Stepanova and Thomas, 2002) and time-varying covariates (Bellotti and Crook, 2009). In recent research some authors have experimented with mixture cure models. These models allow to model a ''cured'' fraction, a part of the population that will not go into default, in survival models.
In the existing literature, some questions remain. Firstly, except for Zhang and Thomas (2012), there has been no attempt to compare a wide range of the available methods in one paper. Secondly, in each of the papers listed in Table 1, only one data set was analyzed, not allowing to draw conclusions on which of the survival methods to use. Finally, in the majority of the papers, the evaluation remains largely focused on classification and the area under the receiver operating characteristics curve (AUC). In this paper, we contribute to the existing literature by analyzing ten different data sets from five banks, using all classes of models listed in Table 1, and using both statistical (AUC and default time predictions) and economic evaluation measures (by predicting the future value of the loan), applicable to all model types considered, the ''plain'' survival models as well as the mixture cure models.
Other interesting modeling approaches exist, though are not included in this comparison. These include discrete time hazard models such as used in Crook (2013, 2014) and Leow and Crook (2016). For further research it would be interesting to compare continuous time with discrete time models.
The remainder of this paper is organized as follows. Section 2 gives an overview of the survival analysis techniques used. In Section 3 the data and the experimental setup are discussed in more detail. The evaluation measures are covered in Section 4, followed by the results and discussion in Sections 5 and 6.

Survival analysis methods
In survival analysis, one is interested in the timing, T, of a certain event. The survival function can be expressed as the probability of not having experienced the event of interest by some stated time t, hence SðtÞ ¼ PðT [ tÞ. In the context of credit risk, the event of interest is default (together with early repayment and maturity for the mixture cure model with multiple events, see Section 2.5). Given the survival function, the probability density function f(u) is given by f ðuÞ ¼ À d du SðuÞ. Additionally, the hazard function hðtÞ ¼ lim Pðt T\t þ s j T [ tÞ s models the instantaneous risk. This function can also be expressed in terms of the survival function and the probability density function hðtÞ ¼ f ðtÞ SðtÞ : In survival analysis, a certain proportion of the cases is censored, which means that for these cases, the event of interest has not yet been observed at the moment of data gathering. In this paper, we use two different definitions for censoring.
1. In the first definition, censored cases are the loans that did not reach their predefined end date at the moment of data gathering (called ''mature'' cases), and did not experience default nor early repayment by this time. 2. According to the second definition, a censored case corresponds to a loan that did not experience default by the moment of data gathering. Early repayment and mature cases are marked censored. This kind of censoring is used in models where default is the only event of interest.
One can interpret these two types of censoring as follows. The first definition defines censoring as we observe it when obtaining the data: some loan applicants defaulted, repayed early or some loans matured (which is, fully repaid at the end of the loan term). The remaining loans, where none of these events have (yet) been observed, are censored. According to the second definition, however, only two possible states are considered (instead of four): default and censoring. Here, all the cases that are labeled mature or early repayment according to the first definition get the label ''censored.'' Hence, when applying survival analysis to model the time to default, the second definition is used (models in Sections 2.1-2.4). Only for the multiple event mixture cure models in Section 2.5, where competing event types are taken into account, the first definition is used. The censoring indicator for the ith case is denoted by d i , which is equal to 1 for an uncensored observation and is zero when censored.
When using survival models as regression models, a covariate vector and corresponding parameter vector are present. In all models in Section 2, the covariate vector is denoted by x, and the parameter vector by b.
where the event rate is slowed down when 0\ expðÀb 0 xÞ\1 and speeded up when expðÀb 0 xÞ [ 1. The hazard function is given by In the general form, the accelerated failure time model can be expressed as a log-linear model for the timing of the event of interest logðTÞ ¼ b 0 x þ r, with a random error following some distribution and r an additional parameter that rescales . As many classical survival distributions such as the Weibull distribution, exponential distribution and log-logistic distribution have event times that are log-linear, AFT models are often used as a starting point in order to parametrize these distributions. The three models mentioned above are used in the benchmark study and covered in Sections 2.1.1-2.1.3. For a full overview on AFT models and more technical details we refer to Collett (2003) and Kleinbaum and Klein (2011). AFT models are used in the credit risk context by Narain (1992) (who used an exponential distribution), Banasik et al (1999) (who used exponential and Weibull distributions) and Zhang and Thomas (2012) (who used Weibull, log-logistic and gamma distributions).
2.1.1. Weibull AFT model The Weibull model in its classical form can be expressed by the following survival and hazard function with scale k and shape p SðtÞ ¼ exp Àkt p ð Þ; hðtÞ ¼ kpt pÀ1 : is the reparametrization used to incorporate the explanatory variables.

Exponential AFT model
The exponential distribution is a special case of the Weibull distribution, with p ¼ 1. This leads to a survival and hazard function SðtÞ ¼ expðÀktÞ; hðtÞ ¼ k: In the exponential distribution the strong assumption of a constant hazard rate k is made, and for each case k i ¼ expðÀb 0 x i Þ. Note that for the exponential is not rescaled Log-logistic AFT model The log-logistic distribution with parameters h and j has a survival and hazard function Using the AFT reparametrization, the relationship r ¼ 1 j and the log-logistically distributed event time T i has a survival function

Cox proportional hazards model
Another method which is commonly used in survival analysis is the Cox proportional hazards model (Cox, 1972). This method is more flexible than any AFT model as it contains a nonparametric baseline hazard function, h 0 ðtÞ, along with a parametric part. In this model, the hazard function is given by and the survival function is with H 0 ðtÞ the cumulative baseline hazard function. In this paper, Breslow's method is used to estimate the cumulative baseline hazard rate, given by where Rðt i Þ denotes the group of individuals at risk at time t i (which are, in the credit risk context, the ones that have not yet defaulted by time t i ). For more information for the Breslow and other estimators for the Cox PH model, we refer to Klein and Moeschberger (2003). The Cox PH model was first used in the credit context by Banasik et al (1999).

Cox proportional hazards model with splines
The hazard function in the Cox PH model, see formula (1), assumes a proportional hazards structure with a log-linear model for the covariates. As a result, for any continuous variable, e.g., age, the default hazard ratio between a 25-and a 30-year-olds is the same as the hazard ratio between an 70-and 75-year-olds. As it is likely that this assumption does not hold, one has been looking for other functional forms of covariates; for an overview, see Therneau and Grambsch (2000). One of the most popular methods to deal with this is by using splines. Splines are flexible functions defined by piecewise polynomials that are joined in points called ''knots.'' Some constraints are imposed to ensure that the overall curve is smooth. Any continuous variable can be represented by a spline, hence where in formula (1) the linear predictor is denoted by Splines can be introduced modeling some or all, say these are m À l, continuous covariates by a spline approximation For an example, see Figure 1. To get a smooth function, a basis of functions with continuous first derivatives is often used to construct a spline function. A popular spline basis is the basis of cubic spline functions 1; x; x 2 ; x 3 ; ðx À j 1 Þ 3 þ ; . . .; x À j q À Á 3 þ with q knots j 1 ; . . .; j q . A spline model is formed by taking a linear combination of the spline basis functions. The disadvantage of power bases, however, lies in the fact that they can become numerically unstable when a large number of knots are included. For this reason, an equivalent basis with more stable numerical properties, the B-spline basis (de Boor, 2001), is nowadays widely used. Both spline models in this study use a cubic B-spline basis in the Cox PH model. For an overview on splines in a general framework, we refer to Ruppert et al (2003).
2.3.1. Natural splines A commonly used modification of the cubic B-spline basis is the natural cubic spline basis. Natural cubic splines satisfy the additional constraint that they are linear in their tails beyond the boundary knots, which are taken to be the endpoints of the data.

Penalized splines
As the number of knots in a spline becomes relatively large, a fitted spline function will show more variation than justified by the data. To limit overfitting, O'Sullivan (1986) introduced a smoothness penalty by integrating the square of the second derivative of the fitted spline function. Later, Eilers et al (1996) showed that this penalty could also be based on higher-order finite differences of adjacent B-splines. Penalized splines or ''P-splines'' use the latter method to estimate spline functions.

Mixture cure model
In the medical context, mixture cure models were motivated by the existence of a subgroup of long-term survivors, or a ''cured'' fraction (see Sy and Taylor, 2000;Peng and Dear, 2000). In contrast to non-mixture survival models, where the event of interest is assumed to take place in the long run, these types of models are typically used in contexts where a certain proportion of the population will not experience the event.
Looking at the credit data, it becomes indeed apparent that a very large proportion of the population will not go into default. ''Cure'' (or ''non-susceptibility'') can here be interpreted as a situation where an individual is not expected to experience default. The mixture cure model is then in fact a mixture distribution where, on one hand, a logistic regression model provides a mixing proportion of the ''non-susceptible'' cases.
On the other hand, a survival model describes the ''survival behavior'' of the cases susceptible to the event of interest. This type of models is of particular interest in credit risk modeling as the event of interest here, default, will not occur for a very high proportion of the cases. This idea was introduced in the credit risk context for the first time by Tong et al (2012). In Dirick et al (2015), a model selection criterion adapted to these models was introduced and applied to credit  Figure 1 Functional form for one of the covariates x, describing the relationship between x and spline approximation f(x) using penalized splines.
x is a variable in one of the ten data sets (more details are not disclosed due to confidentiality reasons). The pointwise 95% confidence bands are given by the dotted lines.
risk data. For the mixture cure model, the unconditional survival function is given by: where Y is the susceptibility indicator (Y ¼ 1 if an account is susceptible, and Y ¼ 0 if not). Note that a new covariate vector x is introduced, which is the covariate vector of the logistic regression model, in this case the binomial logit, with corresponding parameter vector b. In this paper, the conditional survival function modeling the cases that are susceptible is given by a Cox proportional hazards model, : As in the Cox proportional hazards model in a non-mixture context, the Breslow-type estimator is used for estimation of the cumulative baseline hazard. Figure 2 shows the difference between the survival curves for plain survival functions (such as non-mixture Cox PH and AFT functions) compared to the unconditional survival functions of the mixture cure model. Whereas plain survival curves go to zero as the time goes to infinity, the unconditional survival curves for the mixture cure model ''plateau'' at a positive value ð1 À pðxÞÞ.
The mixture cure model is computationally more intensive than plain survival models, as the use of an iterative procedure, the expectation maximization (EM)-algorithm, is needed in order to overcome incomplete information on Y. For more information on mixture cure models, we refer to Farewell (1982), Tong et al (2012) and Dirick et al (2015).

Mixture cure model with multiple events
In the medical context it is unusual to ever truly observe cure. In cancer research, for example, a subject might pass away from the specific cancer under research immediately after the observation period, even though having a high probability of being cured. Observed cure does exist in the credit risk context, since as a loan reaches maturity, it is known that default can not occur anymore. As the censoring indicator in the mixture cure model only provides information on whether default took place or not, information on maturity is not used in the model. Another shortcoming is the fact that it does not account for an important ''competing risk,'' early repayment, where a lender repays the loan before the predetermined end date.
Watkins et al (2014) recently proposed a method that provides simultaneous modeling of multiple events, along with a mature group. Dirick et al (2015) extended this model by allowing for the semi-parametric Cox proportional hazards to model the survival times, instead of the parametric survival models proposed by the former authors. Applied to the credit risk example, three indicators are introduced: 1. Y m , indicating that the loan is considered to be mature, so repaid at the indicated end date of the loan; 2. Y d , indicating that default takes place; 3. Y e , indicating that early repayment takes place.
Note that an important limitation here is that Y m is only defined for fixed term loans. As a result, the multiple event mixture cure model in this form is not usable for applications on revolving credit data sets. For the fixed end term data sets used in this paper, the set of (Y m , Y d , Y e ) is exhaustive and mutually exclusive. However, when an observation is censored (according to the first definition in Section 2), it is not known which event type will occur. In analogy to Equation (2), the unconditional survival function can be written as with S e ðt j Y e ¼ 1; xÞ and S d ðt j Y d ¼ 1; xÞ denoting the conditional survival functions for, respectively, early repayment and default, which are modeled using a Cox proportional hazards model, as in Equation (2). The p j ðxÞ's with j 2 fe; dg are modeled using a multinomial logit model, hence: p e ðxÞ is found analogously. 3. The data and experimental setup

Data preprocessing and missing inputs
We received data sets from five financial institutions in the UK and Belgium, consisting of mainly personal loans and loans of small enterprises, with varying loan terms (for details, see Table 2). Note that this resulted in data sets with either only personal loans or data sets with a mix of personal and small enterprise loans, for banks C and D. As the SMEs (small and medium-sized enterprises) in our data sets were all sole proprietorships, their properties were nearly identical to those of personal loans. More information on the use of survival models in SMEs in the broader sense can be found in, among others (Fantazzini and Figini, 2009;Gupta et al, 2015;Holmes et al, 2010). For the banks with data covering several loan terms, the data were split in order to get only one loan term per data set, resulting in ten data sets. Table 3 lists these data sets which were used to evaluate the different survival techniques listed in Section 2. Except for bank C, where default is defined as missing two consecutive payments, all banks defined default as missing three consecutive months of payments.
As survival analysis techniques are unable to cope with missing data, and with several data sets having a considerable amount of missing inputs, some preprocessing mechanism to cope with missing data is needed. We want to stress that there are many ways of doing this. As this benchmark paper aims to focus on different models, however, rather than data preprocessing (which is typical for benchmarking studies (see Baesens et al, 2003;Dejaeger et al, 2012;Loterman et al, 2012, among others)), we chose to employ the rule of thumb also used in the benchmarking paper by Dejaeger et al (2012). Therefore, for continuous inputs, median imputation is used when B25% of the values are missing, and the inputs are removed if more than 25% is missing. For categorical inputs, a missing value category is created if more than 15% of the values is missing, otherwise the observations associated with the missing values are removed from the data set.
The number of input variables in the resulting data sets varies from 6 to 31, and the number of observations from 7521 to 80,641. For each observation, an indicator for default, early repayment and maturity is included, taking the value of 1 for the respective event of interest that took place, and 0 for the others (note that only one event type can occur for each observation). Percentages of occurrences of these three event types per data set are given in Table 3. For censored observations according to the first censoring definition, all indicators are 0. According to the second censoring definition, only defaults are considered uncensored. In terms of our data sets, this means that censoring rates are ranging from around 20 to 85% according to the first definition (used for the multiple event mixture cure model), whereas censoring percentages are not lower than 94.56% up to 98.16% according to the second definition.
Additionally, a time variable is included for each observation, representing the respective month of the event, which takes an integer value. Note that the time variable for a mature event is always equal to the length of the loan term (e.g., a matured loan for data set 5 has value 24), and the time variable for a censored event is given by the last observed month in which a repayment was observed to take place.

Experimental setup
Each data set was randomly split up in a training set and a test set consisting of 2/3 and 1/3 of the observations, respectively. The models are estimated on the training sets, and the corresponding test sets are used for evaluation.
For all models, the software R is used. AFT and Cox proportional hazards modeling is possible through the use of the R-package survival (Therneau, 2015), with additional use of  functions ns and pspline for inclusion of natural splines and penalized splines in the covariates, respectively. An ad hoc method was used to decide on which of the continuous variables a spline function should be introduced. Using the psplinefunction on each continuous variable in the model separately, the resulting spline curves were inspected to track some possible nonlinear relationships, with knots determined by the adapted AIC method (Eilers et al, 1996, included in the package). The resulting Cox proportional hazards pspline models consisted of all the P-splines where nonlinear relationships were observed. As the ns-function does not have a built-in function to optimize the number of knots, the same continuous variables and number of knots were chosen as in the pspline models. For some of the data sets, the number of splines or knots using the natural splines was altered in comparison with the pspline models, in order to get a feasible fit.
For the mixture cure model, the R-package smcure by Cai et al (2012a, b) is used. An extended code based on this package, as in Dirick et al (2015), is used for the multiple event mixture cure model (code available upon request).

Performance criteria/evaluation metrics
Three main performance criteria were used.

AUC in ROC curves
In the credit risk context, an ubiquitous method to evaluate binary classifiers is by means of the receiver operating characteristics curve. This curve illustrates the performance of a binary classifier for each possible threshold value, by plotting the true positive rate against the false positive rate. The specific measure of interest is the area under the curve (AUC), which can also be computed in the context of survival analysis. In this context, evaluation is possible at any timepoint of the survival curve (see Heagerty and Saha, 2000). For each data set and each model, the AUC for the test sets at 1/3 and 2/3 of the time to maturity and at the maturity time itself (which is equal to the loan term) is listed in Table 4.
Despite the fact that AUC and other classification-based evaluation methods are most common in the literature (see Table 1), this way of evaluating survival analysis in credit risk does not fully highlight the benefits of using survival analysis in this context. First of all, the AUC does not fully summarize the time aspects of survival analysis (the AUC is calculated at one specific timepoint), and secondly, the financial aspect is neglected. The next two sections focus on the timing aspects and economic/financial evaluation, respectively.

Evaluation through default time prediction
When evaluating through default time prediction, we look at how we are able to predict the default times of the defaults in the test set. A survival curve does not give one time estimate, but a distribution of time estimates. With a high amount of censoring, mean values of these survival analyses do not give good predictors. Zhang and Thomas (2012) compute a predictor for the recovery rate in survival analysis by looking at each percentile of the training set and calculate the squared and absolute deviations from the predictions to the observed values of the default cases. Next, the percentiles resulting in the lowest deviations are withheld and used to compute the deviations in the test set.
We use the same method as Zhang and Thomas (2012), but consider the default time instead of recovery values, and look at each permille. For each data set, the permilles that result in smallest deviations for the training sets are withheld and used to compute default time predictions in the test sets. The results are listed in Table 5, where the MSE columns list the mean of the squared differences between the predicted and observed default times, and the MAE columns list the mean of the absolute differences between the predicted and observed default times.
Note that the part of the data set which is evaluated is considerably smaller here compared to the AUC method. A schematic representation is given in Figure 3. Each letter represents an observation in the entire data set, where four possible end states are possible: early repayment (''E''), default (''D''), maturity (''M'') and censored (''C''). The green circle encompasses the test set elements, which are all evaluated when computing the AUC. The default time prediction method, however, only evaluates the default times of the ''actual'' defaults in the test set, which are in the red circle. As the evaluation set differs from one method to another, the sample size for each of the resulting test sets is included in result (Tables 4, 5, 6, 7).

Evaluation using annuity theory
When banks grant a loan to a customer, they are particularly interested in the expected future value at the end of the loan term. One can use the principles of annuity theory (for an overview, see Kellison and Irwin 1991) to compute this value, though these basic principles do not incorporate risk; hence, these formulas start from the assumption that loans will be repaid with a 100% certainty. Including this risk aspect is exactly what can be done using survival analysis, as it provides us with an accurate estimate for the probability that a customer is still repaying his loan at every time instant of the survival curve.
In this study, we computed the true future value of the uncensored test set loans (given by the observations in the blue circle in Figure 3), taking into account their true end-state (default, early repayment or maturity) and compare them to their estimated values using each of the survival models. In order to make the results comparable, some assumptions are made and applied when evaluating the models for all data sets: Table 4 Test set ''areas under the curve'' (AUC) for the different methods applied to the ten data sets when evaluating at several timepoints, corresponding to 1/3, 2/3 and the full loan term, which depends on the data set (d) i the monthly interest rate (i ¼ ð1 þ i y Þ 1=12 À 1); (e) ðEÞFV the (expected) future value of a loan.
A bank can reinvest the repayment sums R s as soon as they are paid by the client. Assume that the same interest rate applies. If there is no risk for default nor early repayment, the future value can be given by For the uncensored test set loans, we wish to estimate the future loan values. In Section 4.3.1, we describe how we compute the true future values when knowing the eventual state (''D,'' ''M'' or ''E''), and in Sections 4.3.2-4.3.3, the expected future loan value is estimated when using the model predictions. Table 6 lists the mean absolute differences between the observed future values and the expected future loan values using the model estimations. In Table 7, we consider the mean expected future values per loan and compare them with the mean observed future value.

The true future loan values
The true future loan value depends on the eventual loan outcome or state. For mature loans, Equation (4) can be used with n the total number of periods or the loan term. Hence, For the future value of a loan with early repayment, the resulting amount of the debt in any time period k is given by When an early repayment takes place in period k, we assume that the loan is repaid as usual until this period k and that the sum L k is fully being repaid in this period. This sum can still be reinvested for n À k À 1 periods. Note that early repayment always yields a smaller revenue compared to a matured loan, The future value for a loan where default takes place after k months is equal to hence, we assume that when default takes place, nothing of the remaining sum L k is recovered. 4.3.3. The expected future loan values using mixture cure models When computing the expected future loan values by means of the mixture cure models, we need to take the results of the binomial (single event) and multinomial logit (multiple event) part of the model into account. For the mixture cure model in Section 2.4, we have probabilities of being susceptible to default or not (PD and 1 À PD) for every subject. We define then we have Figure 3 Schematic representation of the data set. Each letter represents an observation in the data set. The data set elements that are in the test set are in the largest (green) circle. All test set elements are evaluated using the AUC evaluation method. The uncensored test set elements (according to the first definition of censoring, see Section 2) that are in the middle (blue) circle are evaluated through the economic evaluation method using annuity theory. Default time prediction evaluation can only be performed on the defaulted elements of the test set, encompassed by the smallest (red) circle (Color figure online). where b SðtÞ d s;m ¼ b Sðt j Y ¼ 1Þ s;m , denoting the conditional aspect of the survival estimates in the mixture cure context as in (2).
For the multiple event mixture cure model in Section 2.5, the multinomial logit (expression 3) leads in a similar way as (5) to probabilities of early repayment PE and probabilities of maturity PM (which is, in fact, 1 À PD À PE).Here, b SðtÞ e s;m ¼ b Sðt j Y e ¼ 1Þ s;m and b SðtÞ d s;m ¼ b Sðt j Y d ¼ 1Þ s;m are again conditional probabilities (given Y d and Y e ) that subject s has not repaid early or defaulted, respectively, by time t. The expected future value is given by The first two lines of (7) are completely identical to (6), where (1 À PD) is replaced by PM (or, in other words, 1 À PD À PE, as early repayment is also considered here). The second part of the expression is dominated by the event of early repayment. Early repayment works in a similar way as default, in the sense that repayment of the fixed sum R s occurs each month with a probability b SðjÞ e s;m , which explains the first term in the second line of (7). The main difference with default, however, is that when early repayment occurs at timepoint j (this happens with a probability b Sðj À 1Þ e s;m À b SðjÞ e s;m ), the bank receives L s;j , the resulting amount of the outstanding debt at timepoint j. This idea is displayed in the last term of (7).
Note that this expression assumes that the penalty term for early repayment is equal to zero, where in reality usually a fixed fee needs to be payed [see Ma et al (2010), where the fee is 2 months of interest on the outstanding debt]. The reason for this assumption is twofold. First of all, with data from different sources and no information on the extent of early repayment fees, it seems that taking a fee of zero is the more fair decision. On the other hand, where including a fixed fee will increase both the observed and expected future value, it does not seem that the fee will affect the relative performance of the methods.

4.3.4.
Evaluating the expected future value with respect to the observed future value For each of the uncensored test set cases, the observed future value can be computed giving the eventual outcome and be compared with the expected future values using the models. Table 6 lists the mean of the absolute differences between the expected and the observed values per case. In Table 7, the mean expected future values of all uncensored test set loans are listed and can be compared with the mean of the true future loan value at the bottom of the table.

Results
The results in Tables 4, 5, 6 and 7 are grouped per evaluation measure. For Tables 5 and 6, we used a notational convention where the best test result (each time the smallest value) per data set is underlined and denoted in boldface. Performances that are significantly different at a 5% level from the top performance with respect to a one-sided Mann-Whitney test are denoted in boldface (a Bonferroni correction was used due to multiple testing). As the AUC values in Table 4 are point estimates and do not represent samples, here simply the three highest values are underlined for each evaluation time and data set. In Table 7 the three values that lie closest to the mean future value per loan are underlined. Table 8 summarizes the results of all preceding tables by giving the average ranks of the models for all evaluation methods.
In Table 4 we note that the sample size is of real importance to get better receiver operating characteristics curves, as AUC values are generally larger for data sets with more observations. Another factor that seems important is the length of the loan duration. Comparing the AUC results of data set 3 with data set 2, and data set 8 with data set 7, AUC values seem to go down when moving from a shorter loan term to a longer one, though data come from the same bank and has bigger sample size. This might be expected as it is known that making predictions becomes harder when moving to longer time frames. Looking at the overall result in Table 4, however, it is hard to draw conclusions regarding the preferred survival method when looking at the AUC alone, as the values are very close to each other (we note that ties in Table 4 are due to rounding). This can also be seen in Table 8, as average rankings regarding AIC range from 2.8 to 5.6. A Cox PH method with penalized splines shows to be the preferred method each time. A log-logistic AFT model seems to be a good alternative when considering the average ranking, although only appearing 10 out of 30 times among the top three in Table 4. Next we consider mean squared differences (MSE) and mean absolute differences (MAE) from the observed default time (see Table 5). Although many performance measures are not significantly different from the top performance at the 5% level, a general trend for these evaluation measures is that the non-AFT models clearly outperform the AFT models. An interesting observation occurs when looking at the respective sample sizes of the evaluated sets. As depicted in Figure 3, only the actual defaults are evaluated in Table 5. Where most data sets here are quite small (166 cases and less), four sets are still considerable in size: those for data sets 1, 2, 3 and 8. For these data sets, more models can be excluded as their results (for MSE) are significantly worse (in bold). Additionally, we observe for these data sets that the Cox PH is very dominant here, being the best model in seven out of eight cases (considering both MSE and MAE for all four data sets). Considering all ten data sets, especially the exponential AFT model seems to have default time predictions that are significantly far off the true default times. With average rankings having a bigger range compared to ROC (from 1.8 to 8), it seems that the default time prediction measure clearly favors the plain Cox PH model when the sample size is considerable. When less cases were evaluated, the Cox PH with natural splines and the mixture cure model seem to be good alternatives. Table 6 lists the mean of the absolute differences between the model expected future loan value estimates and the true values. Note that these differences are bigger for loans with a longer loan term, which makes sense, as here the loan amounts are larger too. Consulting Table 6 along with 8, it becomes clear that the Cox PH model with penalized splines is again outperforming the other methods (although insignificantly), followed by the Weibull AFT and the plain Cox PH model. The table lists two clearly inferior methods, which are the exponential AFT model and the multiple event mixture cure model.
Regarding the financial metrics in Table 6, the mean absolute differences can get to a substantial size (e.g., in data set 3), but considering Table 7 we note that the mean expected values per loan are close to the mean observed value of the loans for all methods. The results in this table clearly highlight the abilities of survival analysis in the credit risk context. It should be noted that all estimates are very close to the observed mean future value per loan. Where Table 8 highlights the exponential AFT model with an average ranking of 2.2, the mixture cure model performs better than all other methods in five out of ten data sets (data sets 5-8 and 10), whereas exponential AFT is ranked best in three out of ten. Additionally, for Table 7, the mixture cure model tends to outperform on smaller sample sizes, whereas the exponential AFT performs better on bigger sample sizes.
Drawing a general conclusion from Table 8, the shortcomings of AUC again become apparent. Having become a major metric in the financial world to evaluate classification models, this metric is currently often applied to survival analysis models, but there are several issues. Firstly, it does not seem to be able to discriminate one survival model from another one, given small ranges of average rankings, and secondly, it evaluates the models by looking at predictions on individual case levels, in contrast with the default time or expected loan value predictions. Considering these latter evaluation methods, Cox PH models with and without splines and single event mixture cure models seem to be consistently good performers. The main advantage of the mixture cure model lies in the fact The three best values are underlined Lore Dirick et al-Time to default in credit scoring using survival analysis that one basic assumption of survival models, namely the fact that a survival curve should go toward zero when time goes to infinity, and which is often violated for the loan data, is not needed at all when using the mixture cure model. All nonmixture cure models (here wrongly) assume this condition on the survival curve to hold. The multiple event mixture cure model does not seem to live up to the expectations. It is important to note, though, that for a fair evaluation, one would have had to consider the results of other methods when using these to predict early repayment as well, not only default. Modeling default and early repayment in one model, as the multiple event mixture cure model does, as opposed to using two different survival models will likely lead to a better overall result, but additional research needs to be done to verify this.

Discussion
In this paper, we studied the performance of several survival analysis techniques in the credit scoring context. Ten real-life data sets were used, and we used three main evaluation measures to assess model performance: AUC, default time prediction differences and future loan value estimation. It is shown that Cox PH-based models all work particularly well, especially a Cox PH model in combination with penalized splines for the continuous covariates. The Cox PH model usually outperforms the multiple event mixture cure model, but the mixture cure model does not perform significantly different in most of the cases, and is among the top models using economic evaluation. This model has the advantage of not requiring the survival function to go to zero when time goes to infinity, which often is most appropriate for credit scoring data.
Starting from these findings, it would be interesting to further extend the mixture cure model and study the performance of the resulting model in comparison with a Cox PH model with penalized splines. This could be done by allowing for splines in the continuous covariates or time-dependent covariates for these models. Additionally, it would be interesting to run all the models again over data that have been coarse-classified, and compare its results with the results in this study. In particular, it would be interesting to compare the results of coarse classification to the spline-based methods in this study, which can be seen as an alternative for handling nonlinearity in the data. This study also points out that finding an appropriate evaluation measure to compare survival analysis remains an interesting challenge, as the AUC does not seem to have the right properties to really distinguish one method from another.