Heterogeneities among credit risk parameter distributions: the modality defines the best estimation method

Comparative studies investigating the estimation accuracy of statistical methods often arrive at different conclusions. Therefore, it remains unclear which method is best suited for a particular estimation task. While this problem exists in many areas of predictive analytics, it has particular relevance in the banking sector owing to regulatory requirements regarding transparency and quality of estimation methods. For the estimation of the relevant credit risk parameter loss given default (LGD), we find that the different results can be attributed to the modality type of the respective LGD distribution. Specifically, we use cluster analysis to identify heterogeneities among the LGD distributions of loan portfolios of 16 European countries with 32,851 defaulted loans. The analysis leads to three clusters, whose distributions essentially differ in their modality type. For each modality type, we empirically determine the accuracy of 20 estimation methods, including traditional regression and advanced machine learning. We show that the specific modality type is crucial for the best method. The results are not limited to the banking sector, because the present distribution type-dependent recommendation for method selection, which is based on cluster analysis, can also be applied to parameter estimation problems in all areas of predictive analytics.


Introduction
Companies collect and generate large amounts of data, which are analyzed with methods of predictive analytics and used for forecasting and estimation purposes to draw conclusions for business decisions. In particular in the context of such estimation problems, companies and data scientists have a large number of methods at their disposal, ranging from traditional linear regression to advanced methods of machine learning. Due to the large number of methods, the question immediately arises as to which method should be used for a specific estimation problem. Several studies have already dealt with this question, but they come to different conclusions.
For example, in a comprehensive study King et al. (1995) apply 16 methods, including traditional and advanced methods, to 12 real-world estimation problems in the fields of image processing, medicine, engineering, and finance. They find that there is no consistently superior method and that the respective estimation performance depend critically on the datasets used. More recently, Baumann et al. (2019) perform a detailed comparison of 14 machine learning methods, which they analyzed with regard to the estimation accuracy using 20 datasets from different fields such as life sciences, physical sciences, engineering, social sciences, economics and others. Again, it becomes apparent that the best method in each case changes depending on the dataset.
The optimal choice of estimation method is particularly relevant in the banking sector. One reason for this is a regulatory requirement as per which banks must provide their own estimates of risk parameters when using the internal ratingsbased approach. In addition to the regulatory requirement, accurate predictions of risk parameters are relevant for the risk-adjusted pricing of loans. Estimation methods with high predictive accuracy offer banks a competitive advantage, whereas weak predictions can lead to adverse selection. In this study, we focus on the loss given default (LGD), which is one of three relevant parameters to estimate the risk associated with a credit product. Also with regard to LGD, several studies exist in the literature examining different methods in terms of their estimation accuracy. Because these studies show considerable differences in results, it also remains unclear for LGD estimation, which method has the highest estimation accuracy. For example, Hurlin et al. (2018) base their analysis on defaulted customers in Brazil, finding that the random forest mostly outperforms other advanced methods and that the regression tree shows low estimation accuracy. Kaposty et al. (2020) arrive at a comparable conclusion. Using a dataset of defaulted corporate leases in Germany, they find that more sophisticated methods, especially the random forest, lead to remarkable increases in the prediction accuracy. In contrast, Yao et al. (2017) examine data on U.K. bank credit cards and conclude that a combination of least squares support vector regression and ordinary least squares regression leads to the best out-of-sample estimation accuracy compared to 11 alternative (and combined) methods. Similarly, Loterman et al. (2012) compare 24 methods and find that support vector regression and an artificial neural network perform significantly better than other methods do, including 1 3 Heterogeneities among credit risk parameter distributions:… the regression tree. However, in their examination of U.S. revolving credit loans, Tobback et al. (2014) find that the forecasting accuracy of support vector regression is lower than that of a regression tree. Consistent with this finding, Bastos (2010), Qi andZhao (2011), andHartmann-Wendels et al. (2014) recommend using a regression tree to predict the recovery rates of Portuguese, U.S., and German loans, respectively. By using recovery rates of U.S. non-financial corporations, Altman and Kalotay (2014) show that mixture regressions provide more accurate out-of-sample estimates than other regression-based methods. The high predictive accuracy of mixture regressions is confirmed by a study of Min et al. (2020), who conduct a method comparison (including regression tree and neural network) based on recovery rates of small-and medium-sized entities in the U.S. More recently, Bellotti et al. (2021) compare different methods for the prediction of European non-performing loans and find that rule-based algorithms such as cubist regression model, boosted trees, and random forest perform significantly better than other approaches. Consequently, most studies show that advanced methods have higher estimation accuracies than traditional methods, such as the ordinary least squares regression. However, based on a dataset of U.K. defaulted credit cards, Bellotti and Crook (2012) find that the ordinary least squares regression is superior to advanced methods. Similarly, Sopitpongstorn et al. (2021) find that advanced methods such as artificial neural network and regression tree show lower predictive performance than the (local) logit regression in the prediction of loan recovery rates.
In summary, views on how well various LGD estimation methods perform are mixed. The different results can be attributed, in particular, to the different countries in which the loan portfolios are located (see, e.g., Bastos (2010)). More specifically, Grunert and Weber (2009) note that most studies focus on the U.S. banking sector but there may be national differences in bankruptcy law or the characteristics of borrowing companies. In addition, Grippa et al. (2005) and Querci (2005) observe differences in LGD characteristics across geographic regions in their investigation of Italian accounts. The results indicate that the LGD distributions in credit portfolios seem to differ between regions and countries.
It remains unclear which characteristics of an LGD distribution are responsible for the different performance results in the literature. Against this background, the present study aims to identify the distributional features relevant to the quality of LGD estimation methods, and subsequently, determine the methods that have the highest estimation accuracy for the relevant distribution types. In this way regulators, for example, obtain simple rules for LGD estimation in the banking environment without having to rely on the specific loan portfolio of a bank. This is relevant insofar as the recently introduced constraints on the use of internal ratings-based approach demonstrate that the regulators are basically striving to simplify and standardize the estimation of credit risks (see Basel Committee on Banking Supervision (2016).
To this end, we must achieve the following objectives: consider a broad international loan portfolio; identify heterogeneities among LGD distributions, that is, the relevant characteristics and types of distributions; and compare the estimation methods individually for each LGD distribution type. To meet the first objective, we 1 3 Heterogeneities among credit risk parameter distributions:… The remainder of this article is structured as follows. Section 2 introduces the data, including the descriptive statistics, describes the estimation methods used in the comparative analysis, presents the cluster analysis, and identifies the three relevant LGD distribution types. Section 3 contains the comparative analysis for each of the three distribution types and the resulting recommendations for LGD estimation. In Section 4 several robustness checks are performed. Section 5 concludes the paper.

Data and LGD estimation approach
To conduct the study, we use the database of Global Credit Data 2 , which contains detailed information on loan defaults of 55 banks. In this section, we first introduce the data used and provide descriptive statistics. Afterwards, we explain the theoretical background of the LGD estimation methods used in the comparative analysis. Finally, we describe the cluster analysis and identify heterogeneities among LGD distributions.

Data description
The analysis is based on a dataset of resolved defaulted loans by SMEs from 16 European countries. The estimation of the LGD is based on workout recovery rates, which are calculated as the difference between all discounted post-default incoming cash flows ( F + ) and all discounted post-default costs ( C − ), divided by the EAD. That is, Incoming cash flows comprise principal, interest, and post-resolution payments, the recorded book value of collateral, received fees, and commissions. Costs include legal expenses, administrator and receiver fees, liquidation expenses, and other external workout costs. All cash flows are discounted using the three-month EURI-BOR of the respective default date.
In the following, we briefly describe the restrictions we apply to the raw dataset, which comprises 38,166 defaulted loans. The filter rules are based on Gürtler and Hibbeln (2013), European Banking Authority (2016), Krüger and Rösch (2017), and Betz et al. (2018).
First, we restrict the sample to all defaults since 2000 and do not include defaults after 2016. The lower time limit is selected to ensure the consistent default definition of Basel II and thus prevent biased estimation results. The upper time limit is selected for two reasons. In the subsample of recently defaulted loans (with completed workout processes), short workout periods are obviously overrepresented. Because, in turn, loans with shorter workout periods tend to be associated with lower LGDs, this subsample can lead to a sample selection bias. In addition, workout processes of recent defaults are not necessarily completed. By limiting the time span of the dataset, we remove 2,177 observations. Second, cures are not considered, because they do not provide default data with actual losses (see Krüger and Rösch (2017)). By excluding cures, we drop 1,147 observations. Third, in the Global Credit Data database, the default losses range from zero (e.g., for uncalled contingent facilities) to several hundred million euros. To satisfy a materiality threshold, we remove loans with an EAD of less than €500. Using this threshold, we exclude 978 observations. Fourth, to correct input errors and to ensure consistent and plausible data, we eliminate loans with an abnormally low or high LGD, i.e., smaller than −100% and higher than 200%, respectively. We drop 265 observations. Finally, loans with incomplete observations are excluded. We remove 748 observations. Overall, a dataset of 32,851 loans remains. Table 1 presents the descriptive statistics; in particular, it shows the means and quantiles of the LGDs of selected loan categories. We separate the loans by the availability of guarantees, collateral type, facility type, seniority type, and industry type. The table is a further indicator of the plausibility of the dataset. For example, the existence of guarantees or securities reduces LGDs. Conversely, non-senior and short-term loans lead to higher LGDs. Interestingly, loans from the "finance, insurance, real estate" sector have the lowest LGDs.
In addition to the loan-specific properties, we consider macroeconomics variables to improve the prediction of the LGD, as suggested in the literature. 3 Therefore, we use various macroeconomic control variables. For the overall real and financial environment in Europe, we use the return of the STOXX 600 Index. Because Qi and Zhao (2011) and Chava et al. (2011) identify the three-month treasury bill as a significant variable to consider expectations of future financial conditions in the U.S., we use the six-month EURIBOR as a significant driver for LGDs in Europe. Following Mora (2015) and Yao et al. (2015), we also use the GDP to measure the market value of all final goods and services produced in the considered period in Europe. Specifically, we consider a dummy variable that indicates whether the GDP has increased from the previous quarter. We also tested other popular macroeconomic variables, such as the 10-year euro area yield, unemployment rate, and economic sentiment index. However, they were excluded owing to their strong correlations with other macroeconomic factors and thus, lower explanatory power.

LGD estimation methods
The challenge in LGD estimation lies in providing estimated LGDs that are close to and highly correlated with the true LGDs. As such, there is a wide range of estimation methods used in the literature. For a comprehensive analysis, all methods used in the literature must be included in the comparison. In selecting the procedures, we followed Bellotti et al. (2021) who examined 18 LGD estimation procedures. By considering two further procedures, we finally use 20 different methods, categorized as either traditional or advanced methods, as noted in Section 1. In the following, we briefly introduce the competing methods (summarized in Table 2) as well as the main references in each case.
We use linear regression as the first traditional method because it is usually used as a reference method in other LGD studies. For instance, the linear regression has been implemented in a comparative context by Loterman et al. (2012) and Krüger and Rösch (2017). From a statistical perspective, linear regression has restrictions that may make it less suitable for LGD estimation. For this reason, we include other traditional methods that address these restrictions.
First, linear regression requires exogenously identifying the best subset of the variables to include in the model. The wrong choice of variables can induce problems such as biased regression coefficients (if relevant variables are omitted) or a decrease in estimation precision (if irrelevant variables are included in the model). To overcome the difficulties in variable selection, we use ordinary least squares regression with backward elimination and least angle regression. We use backward elimination instead of simple forward selection, which has the disadvantage of neglecting variable interactions (see, e.g., Smith (2018)). The purpose of both applied methods is to build a multiple regression model that includes a parsimonious set of variables, without compromising the estimating ability of the model. The use of variable selection methods in LGD estimation is proposed, for instance, by Hartmann-Wendels et al. (2014) and Ye and Bellotti (2019).
Second, linear regression models that contain multiple variables are susceptible to overfitting and may reveal a high variance of the model estimators, which typically results in a high expected mean squared error (hereafter referred to as "estimation error"). To reduce the parameter variance, we apply penalized regressions (i.e., ridge regression, lasso regression and elastic net regression), which introduce constraints that limit the model parameters. 4 In the LGD estimation, penalized regressions are implemented by Loterman et al. (2012), among others.
Third, the values predicted by linear regression can theoretically range from minus infinity to infinity. Because LGDs are restricted by a lower limit (close to zero) and an upper limit (close to one), linear regression will not meet this restriction. To consider the LGD boundaries, we use fractional logit regression. Because we use data with plausible values out of [0, 1] 5 , we transform the observed LGD using the equation below (as proposed by Krüger and Rösch (2017)) before performing the fractional logit regression:

3
After the estimation, we re-transform the predicted values to the previous LGD scale. Fractional logit regressions are used for LGD estimation by Dermine and de Carvalho (2006) and Chava et al. (2011), among others.
In addition to the traditional methods, we use various advanced methods. These are assumed to have an improved estimation accuracy because they do not require a specific functional form or distribution assumptions (see Nazemi et al. (2017), LGD 0,1 = LGD − min(LGD) max(LGD) − min(LGD)  Miller and Töws (2018), and Yao et al. (2015)). At the same time, advanced methods are more prone to overfitting than the traditional methods, which may, in turn, lead to inferior estimation accuracy (see Qi and Zhao (2011)). Therefore, hyperparameter tuning and a good understanding of the methods' functioning are required when using the advanced methods. A description of the methods used in the comparative analysis is given below. As the first advanced methods, we use various rule-based methods because they allow nonparametric representations of the relationships between the dependent and explanatory variables. The most basic method in this context is the regression tree, popularized by Breiman (1984). The method recursively splits the data into groups and uses the group averages of the dependent variable as its mean prediction. This approach has been applied to LGD estimation by, for example, Matuszyk et al. (2010) and Hurlin et al. (2018). However, regression trees often suffer from variable selection bias (see Strobl 2005), that is, predictor variables with a higher number of possible realizations (and thus, a higher number of possible cut points) have a higher probability of being chosen in the  Qi and Zhao (2011), Krüger and Rösch (2017) Variable selection methods: (1) OLS with Backward Elimination (bOLS) tree-growing step. Thus, selecting variables with low importance, that is, with low information content for predicting the LGD, in this way may lead to worse trees with higher estimation errors. To overcome this limitation, we also use a conditional inference tree by Hothorn et al. (2006). This algorithm separates the variable selection process from the splitting procedure. Moreover, regression trees aim to minimize the omitted variable bias, which can be achieved by trees that are grown very deep. However, deep-grown trees tend to overfit the training data, leading to poor out-of-sample estimation accuracy. To overcome these challenges, we also use the random forest algorithm by Breiman (2001). It is a bootstrap aggregation method of de-correlated regression trees that are independently built using random subsets of variables and trained on different parts of the same training set (see, for example, Hastie et al. (2017, chapter 15). After the random forest algorithm has grown an ensemble of regression trees, an average is formed over all regression trees to establish the estimation. Because the trees in a random forest are de-correlated, the method is less prone to overfitting. By averaging across trees, the variance is also reduced, which, in general, leads to a higher estimation accuracy. The use of a random forest for LGD estimation is proposed by, for example, Bastos (2014) and Hurlin et al. (2018). However, in addition to the above-mentioned advantages, the random forest has one decisive disadvantage. When learning imbalanced data (e.g., in a loan portfolio, where most defaulted credits have LGDs close to zero and fewer loans have LGDs close to one), there is a significant probability that the bootstrap samples contain few data of the minority class (i.e., loans with LGDs close to one). This results in biased trees that perform poorly when estimating the minority class (see Chen and Breiman (2004)). Thus, by averaging over all trees, including the biased trees, the estimation accuracy of the random forest can be reduced. Therefore, we also apply boostingbased algorithms, because they focus on incorrectly estimated samples. In a random forest, the trees are built in parallel. In boosting, the trees are built sequentially, and each tries to reduce the bias of the preceding tree. Therefore, using boosting, we build a model in a non-random way that is less susceptible to imbalanced data and makes fewer estimation errors as more trees are added.
For boosting, we use two algorithms. First, we use the adaptive boosting method of Freund and Schapie (1996), which adapts the trees by differently weighting the incorrectly and correctly estimated samples. Second, we use the gradient boosting method of Friedman (2001), which fits each new tree to the residual errors made by the previous tree. The use of boosting methods for LGD estimation is proposed by, for example, Hurlin et al. (2018) and Tanoue and Yamashita (2019). Note that the rule-based methods do not require separate and prior variable selection, because their strategies automatically rank variables by their contribution to the decrease in the estimation error. As an additional rule-based method, we use the cubist regression model by Quinlan (1993). The algorithm uses two ways for improving a modified version of regression tree prediction. First, the cubist model uses a boostinglike framework called "committees" in which iterative model trees are created in sequence to correct for estimation errors. Second, it uses a weighted average of nearest sample neighbors to adjust the predictions.
Another considered advanced method is an artificial neural network, proposed by, for example, Bishop (1995), because it can describe non-relationships in coefficients. Simply put, it is a computational model that consists of several highly interconnected processing elements that process information by their dynamic state response to external inputs. In particular, we use a multilayer perceptron, which consists of a three-layer network (an input layer, a hidden layer, and an output layer). To calculate the artificial neural network, we use a resilient backpropagation algorithm that guarantees an approximation of the estimation value through iterative model updates (see Hastie et al. (2017) for a more detailed description of artificial neural networks). Despite their above-mentioned advantages, artificial neural networks have two disadvantages: Owing to their complexity, they are prone to overfitting, and thus, require intensive hyperparameter tuning, leading to a greater computational cost. Artificial neural networks also have been used for LGD estimation by, for instance, Qi and Zhao (2011).
Additional advanced methods are two different types of vector regressions. More precisely, we consider the support vector regression introduced by Vapnik (1995), and the relevance vector regression by Tipping and Smola (2001). Both methods extend the linear regression by considering relationships that are not linear in the coefficients and are supposed to offer improved accuracy in LGD estimation. The idea of vector regressions is to map the data into a higher dimensional space using a mapping function before performing the linear regression. 6 In the comparative analysis, we choose a radial-basis function kernel for both methods to map the data into a higher dimensional space. Similar to artificial neural networks, vector regressions are prone to overfitting, and thus, need extensive computing requirements for hyperparameter tuning. Nevertheless, some studies illustrate the good predictive performance of vector regressions for LGD estimation (see Yao et al. (2015Yao et al. ( , 2017 and Nazemi et al. (2017)).
We also apply a Gaussian process regression by Williams and Rasmussen (1996), which is already proposed in the context of LGD estimation by, for example, Bellotti et al. (2021), and can be considered as a nonparametric generalization of the relevance vector regression. Instead of calculating the probability distribution of the coefficients of the regression function, the Gaussian process directly imposes a prior (Gaussian) distribution on the functional values. In the present study, we implement the Gaussian process by using a radial basis kernel.
Next, we consider the k-nearest neighbors algorithm, owing to its simplicity in dealing with nonlinear data. The algorithm uses "variable similarities" to estimate the values of any new data point. The new point is assigned a value based on the k-nearest points of a neighborhood in the Euclidean space. Because the k-nearest neighbors algorithm simply chooses the neighbors based on distance criteria, it is highly sensitive to outliers, which can lead to an inferior estimation accuracy. The k-nearest neighbors algorithm has been used for LGD estimation by, for instance, Yang and Tkachenko (2012) and Hartmann-Wendels et al. (2014).
As an additional nonparametric method, we use multivariate adaptive regression splines by Friedman (1991), which is an extension of linear regression. In this method, the training data are first partitioned into separate piecewise linear segments (splines) with different gradients. After partitioning, the splines are connected smoothly together, resulting in a flexible model that can handle both linear and nonlinear relationships between the dependent and independent variables. The use of multivariate adaptive regression splines for LGD estimation is proposed by, for example, Loterman et al. (2012) and Bellotti et al. (2021).
Finally, we apply a finite mixture model by Leisch (2004) owing to its promising results in LGD estimation. The algorithm uses probabilistic clustering and applies an individual linear regression model for each cluster (referred to as component). The use of the finite mixture model in the LGD estimation is proposed by, for example, Krüger and Rösch (2017) and Min et al. (2020).

Clustering and LGD distribution analysis
As noted above, the main motivation for this study is based on two findings from the literature. First, existing LGD studies identify different LGD estimation methods as having the highest estimation accuracy. Second, the LGD distributions in credit portfolios seem to differ by country. This, in particular, may explain the mixed results on the accuracy of LGD estimation methods, as each study uses data of a specific country (or region), which have a specific LGD distribution.
Against this background, we identify types of LGD distributions from an international portfolio of European loans and compare the estimation quality of the methods for each type of LGD distribution. Specifically, we identify country-specific types of LGD distributions from 16 European countries and apply cluster analysis to identify relevant types of distributions. Accordingly, we build country-specific subsamples for the respective distribution clusters (in the present subsection). Subsequently, in Section 3, we compare the methods to identify the LGD estimation method with the highest accuracy for each subsample. The procedure and the results of the cluster analysis are summarized as follows.
For each country in the dataset, we aggregate the LGDs of all defaulted loans based on the LGD quantiles in a range from 1% to 100% with a stepwise increase of 1%. We then cluster the resulting 16 country-specific LGD distributions using the agglomerative hierarchical clustering of Ward (1963) and the k-means clustering of MacQueen (1967). For both approaches, the Euclidean distance was chosen as the distance measure. The final number of clusters is given if the distance between the clusters proposed by the algorithm exceeds a predefined threshold value. Thus, both approaches lead to the same results where the country-specific LGD distributions are assigned to three clusters. 7

3
Heterogeneities among credit risk parameter distributions:… The results are shown in Fig. 1. By aggregating the LGDs from the countries belonging to cluster 1, we see a (nearly) symmetric bimodal LGD distribution, with two extreme events (total losses and total recoveries) being equally likely. If we aggregate the LGDs of all loans from countries belonging to the cluster 2, most of the LGDs characterize (nearly) total losses or total recoveries, also yielding a strong bimodality of the distribution. In contrast to cluster 1, total recoveries in cluster 2 are more likely than total losses (approximately in the ratio 2:1). More precisely, we can identify an asymmetric (positively skewed) bimodal distribution. Finally, if we consider the LGDs from the countries in cluster 3, we find a (positively skewed) unimodal LGD distribution that differs significantly from the LGD distributions in the other two clusters because most of the LGDs characterize total recoveries.
In summary, the cluster analysis identifies heterogeneities among the LGD distributions that differ particularly in terms of their modality. To examine the quality of the clustering result, we apply two different test statistics. Both, the paired t-test and the Mann-Whitney U test confirm the dissimilarity among the three resulting country distributions. 8 Considering the effect of insolvency legislation on LGD, the resulting distribution clusters are economically understandable. According to a study by the European Commission (2016), the time to resolve insolvency and the cost of resolving insolvency is higher in the countries of cluster 1 compared with the countries in clusters 2 and 3. Because an increase in both components (ceteris paribus) implies higher LGDs, it is understandable that the share of high LGDs is higher in countries of cluster 1 than in countries of clusters 2 and 3. This, in turn, explains the second mode ( LGD = 1 ) in the first cluster.

Comparative analysis
In this section, we first introduce the procedure and measures used to compare the predictive performances of the LGD methods. Subsequently, we describe the procedure for determining appropriate hyperparameter values for the competing advanced methods and present the selected values for these methods in each cluster. Finally, we state and discuss the cluster-specific results of the comparative analysis.

Model comparison procedure
First, we split the dataset 9 into a subsample for training (in-sample calibration) and a subsample for testing (out-of-sample estimation), which is a common approach in LGD studies (see Gürtler and Hibbeln (2013) or Hartmann-Wendels et al. (2014)). For robustness, the results should be independent of the specific split of the dataset into training and test data. To achieve this, we split the dataset randomly 10 according to different split ratios. Specifically, we split the datasets according to a (60/40), (70/30), (80/20), and (90/10) training/test split ratio. For each sample split, the LGD estimation methods are in-sample calibrated on the training dataset. The calibrated methods are applied to the test dataset to estimate out-of-sample LGDs. To measure the estimation accuracy, we use three popular out-of-sample criteria (see Hurlin et al. (2018), Krüger and Rösch (2017), or Qi and Zhao (2011)): the mean absolute error (MAE), the mean squared error (MSE), and the coefficient of determination, which are defined as follows: where n corresponds to the number of observations in the respective dataset; LGD i denotes the true LGD value of the i th credit, L GD i,m denotes the corresponding LGD estimation using method m, and LGD corresponds to the arithmetic mean of the LGD frequency and approximated density distributions. Note. The country names are made anonymous because of a non-disclosure agreement true LGD values. Finally, the mean of each criteria calculated over all sample splits denotes the estimation accuracy of the respective LGD estimation method.

Hyperparameter tuning
To provide the best out-of-sample estimation accuracy for each method, it is partially necessary to determine a suitable set of hyperparameter values, a process called hyperparameter tuning. We determine hyperparameter values for the advanced methods, as well as the penalized regressions for each cluster and each sample split, using a five-fold cross-validation (see Nazemi et al. (2017) and Hurlin et al. (2018)) and a random search algorithm. We use the random search algorithm instead of the grid search algorithm because a random search leads to significantly shorter runtimes, and delivers accurate results if the number of iterations is sufficiently high (see Bergstra and Bengio (2012)). The final choice of hyperparameter values in the tuning process is based on the same criteria used in the model comparison procedure.
The hyperparameter tuning process is based on the respective training subsamples and can be described as follows. We first separate the respective training dataset into five subsamples. Of the five subsamples, four are used for (in-sample) calibration, while the remaining set is used for out-of-sample testing. This procedure is carried out five times with a changing test dataset. Within this process, the random search algorithm trains the considered LGD estimation method based on 1,000 different hyperparameter settings, where the hyperparameters are chosen randomly from a predefined hyperparameter set. The number of hyperparameter settings follows Bergstra and Bengio (2012). Finally, the random search algorithm chooses the hyperparameter values with the highest estimation accuracy (e.g., the smallest MSE on the test set). For each estimation method and cluster, the chosen hyperparameters, corresponding sets, and final choice of hyperparameter values are given in Table 3. For clarity, we limit the presentation of the hyperparameter tuning results in each cluster to the (70/30) sample split (with the MSE as evaluation criterion), which is often used in other LGD studies (see Qi and Zhao (2011) and Gürtler and Hibbeln (2013)). 11 Below, we briefly summarize the main results.
First, for the lasso and elastic regressions, we find the same optimal hyperparameter values ( = 0.001 in cluster 1 and = 0.0001 in clusters 2 and 3), whereas the ridge regression deviates significantly from these values = 100 in each cluster).
Second, for the rule-based methods, the results are as follows. For the regression tree, the trees in each cluster are similar in terms of size (6-7), while the tree in cluster 2 has a lower minimum "node size" (9 instead of 14 or 17). For the conditional inference tree and random forest, the tuning process shows that the number of splitting variables is similar in each cluster. The random forest also builds higher tree sizes than those of the regression tree in each cluster (9 instead of 6 or 7) and

3
Heterogeneities among credit risk parameter distributions:…  Note. The names of the chosen hyperparameters and a description of the estimation methods can be found in Hastie et al. (2017) produces a high number of trees (777-873). Compared with the random forest, both boosting methods build smaller trees in each cluster ("tree size" ∈ {4, 5, 6} instead of "tree size" = 9 ) and produce distinctly fewer trees (112-168 instead of 777-873). For the boosting methods, we use a constant learning rate that corresponds to the speed with which the error is corrected from each tree to the next. A high learning rate requires a lower number of trees, and, conversely, a low learning rate requires a higher number of trees. As we test the number of trees for a predefined set, a constant learning rate is appropriate. The trees in the cubist regression model are (nearly) similar in clusters 2 and 3, while the first cluster produces a lower number of committees (23 instead of 62 and 64). Third, for the artificial neural network, in each cluster, the second 12 hidden layer is eliminated and the number of neurons in the first hidden layer varies from five to six. In each cluster, the logistic function is preferred as the activation function. As a solver for the weight optimization, we use the stochastic gradient-based optimizer (with a learning rate of 0.001) proposed by Kingma and Ba (2014) because it is recommended for large datasets.
Fourth, in the support vector regression, the cost parameter value is the same for each cluster, while the number of selected support vectors increases from cluster 1 to cluster 3. For the relevance vector regression and the Gaussian process regression, the inverse kernel width increases from cluster 1 to 3. As a kernel function, we use the radial basis function for each of the three last-mentioned methods because of its good overall performance for vector machines (see Baesens et al. (2000)).
Fifth, in the k-nearest neighbors method, the number of nearest neighbors in cluster 3 ( k = 11 ) deviates from those of clusters 1 and 2 ( k ∈ {20, 22}).
Sixth, for the multivariate adaptive regression splines, both tuning parameters (maximum degree of input parameters and number of terms to retain in the final regression function) increase from cluster 1 to cluster 2 and decrease to cluster 3.
Finally, in the finite mixture model, the number of mixture components are identical in each cluster.

Out-of-sample results
In this subsection, we present the results of our comparative analysis. Because the cluster-specific best estimation methods are identical for all selected performance measures, we only present the results based on the MSE and MAE in detail for reasons of clarity. 13 Tables 4 and 5 show the out-of-sample estimation accuracies of the LGD methods. The resulting MSEs and MAEs are shown separately for the different clusters and split ratios. The final assessment of the methods is based on their mean MSE and mean MAE, respectively.
In cluster 1, where the LGDs are symmetrically bimodally distributed, the results can be summarized as follows. First, the traditional methods are similar in terms of their mean MSE, with the ordinary least squares regression having the lowest (0.1331) and the fractional logit regression leading to the highest mean MSE (0.1335). In terms of their mean MAEs, the results are comparable with lasso regression showing the worst performance (0.3278). Second, not all advanced methods outperform the traditional methods. In particular, the k-nearest neighbors method and the artificial neural network exhibit relatively weak performance, with a mean MSE of 0.1431 and 0.1336, respectively. Considering the mean MAE, the neural network outperforms the traditional methods, with a mean MAE of 0.3107, and also shows better performance compared to other advanced methods such as the Gaussian process regression (0.3192) and the conditional inference tree (0.3160). Again, the k-nearest neighbors method performs worst with a mean MAE of 0.3343. Furthermore, the traditional methods have the highest mean MSE and mean MAE for the 70/30 sample split, whereas the performance of the rule-based methods improves with the size of the training sample. Each model extension of the simple regression tree also shows an improvement in MSE and MAE (decrease in MSE and MAE by at least 0.0094 and 0.0107, respectively). Finally, the decisive result is that the random forest distinctly outperforms the other methods -even in each sample split -with a mean MSE of 0.1241 and a mean MAE of 0.3067. Considering the mean MSE, it is followed by the gradient boosting method (0.1305) and support vector regression (0.1322). For the mean MAE, the cubist regression model (0.3068) and the relevance vector regression (0.3095) are the next best methods.
In cluster 2, where the LGDs follow an asymmetric (positively skewed) bimodal LGD distribution, the results are slightly different. First, considering the mean MSE, all of the traditional LGD estimation methods show lower estimation accuracies than in cluster 1. On average, the mean MSE has increased by 0.0041 for the traditional methods. In contrast, for the advanced methods, the mean MSE increased by 0.0014 on average. Considering the mean MAEs, the results are different. Most of the methods (except adaptive boosting method and finite mixture model) show higher estimation accuracies than in cluster 1. Second, while the fractional logit regression performs poorly in cluster 1 for both criteria, it outperforms the other traditional methods in cluster 2. Third, most of the advanced methods (except the adaptive boosting method) outperform the traditional methods, and the performance of each method increases with the size of the training sample. Finally, the gradient boosting method shows the lowest MSEs and MAEs for each sample split and leads to the lowest mean MSE of 0.1276 and the lowest mean MAE of 0.2712. For the mean MSE, it is followed by the random forest (0.1319) and the Gaussian process regression (0.1329). For the mean MAE, the cubist regression model and the support vector regression perform second best and third best, respectively.
In cluster 3, the case of (positively skewed) unimodally distributed LGDs, the MSEs of the advanced methods have evidently been reduced by about half, and the mean MSE ranges from 0.0455 to 0.0788. In contrast, the traditional methods show a reduction in the MSEs of about one third and the mean MSE ranges from 0.0807 to 0.0840. Obviously, the MAEs show equivalent results. That is, all methods can handle unimodal distributions better than bimodal distributions. While the ordinary least squares regression proves to be the best of the traditional methods (as in cluster 1) for the MSEs, it is lasso regression for the MAEs. The artificial neural network  Heterogeneities among credit risk parameter distributions:…  and the adaptive boosting method perform worst for both criteria. In accordance with clusters 1 and 2, the random forest and the gradient boosting show good performance, while their absolute difference in mean MSE is slightly less than in the other two clusters. The decisive result is that the finite mixture model unmistakably outperforms the other methods-even in each sample split-with a mean MSE of 0.0455 and a mean MAE of 0.1480. For the mean MSE, it is followed by the random forest (0.0561) and the gradient boosting method (0.0572). For the mean MAE, the random forest (0.1488) and the cubist regression model (0.1552) are the next best methods.
To exclude the possibility that some superiority may have occurred by chance, we also perform a paired t-test in each cluster to compare the mean MSE and mean MAE of the five best methods (similar to Yao et al. (2017) and Hurlin et al. (2018)). 14 The key insights of the pairwise tests are as follows. Considering cluster 1, the differences in mean MSE between the random forest and the next best methods are always negative at the 5% significance level. In contrast, the differences in the mean MAE are negative at the 10% significance level. That is, the random forest shows (marginal) significant superiority. In the context of clusters 2 and 3, the two best-performing methods (i.e., the gradient boosting method and the finite mixture model) outperform all other models significantly. Precisely, the differences between the corresponding mean MSEs and mean MAEs are always negative at the 5% significance level.
Broadly, the results indicate that the advanced methods outperform the traditional methods overall. However, the relatively weak performance of the artificial neural network shows that, even with a systematic choice of hyperparameters, overfitting remains a challenging issue when applying advanced methods to an LGD estimation. Further, the level of estimation accuracy is related to the respective LGD distribution. For bimodal distributions, all methods show considerably worse performance than for unimodal distributions. This result is understandable because the methods (mostly) correctly estimate a low LGD for a unimodal distribution based on the randomly drawn training dataset. In the case of a bimodal distribution with two different modes, the estimation is discernibly more difficult. Moreover, the type of distribution is crucial for the best-performing method. For symmetrically bimodally distributed LGDs, the random forest implies the highest estimation accuracy and the paired t-test shows its significant superiority compared with the other next best methods. For the asymmetric bimodal LGD distribution, the gradient boosting method shows the best performance, which is also significant at the 5% level. This result is understandable and can be explained as follows. A central difference between the random forest and gradient boosting method is the simultaneous or iterative construction of individual trees. Because of the high probability of small LGDs in the case of asymmetric bimodal distributions, the random forest creates a high proportion of trees that belong to low LGDs. Here, the "learning effect" of the method is missing because of its simultaneous structure. In contrast, the iterative approach of gradient boosting leads to better identification of high LGDs, even if high LGDs only have a small proportion. That is, this method has advantages for asymmetrical bimodal distributions. Finally, in the case of a unimodal LGD distribution, the finite mixture model shows the highest estimation accuracy, which is also significant compared to the other methods. Because of the simple nature of the distribution, the finite mixture model can easily identify suitable components for which the separate linear regressions then lead to good estimation results. Although some other methods also use partitioning strategies, the finite mixture model seems to be particularly suitable for unimodal LGD distributions.

Robustness checks
To investigate the robustness of the performance results, we consider four modifications of the estimation approach: First, we extend the methods by including additional explanatory variables. Second, we change the clustering procedure by clustering the LGD distributions based on a loan-specific variable rather than country. Third, we apply a logarithmic transformation to the positively skewed unimodally distributed LGDs to get a more normal-like distribution. Fourth, we change the dataset and use non-European credit portfolios that are characterized by the same three types of LGD distributions.
In each robustness check, we rerun our method comparison procedure, that is, the methods are re-calibrated, optimal hyperparameter values are re-determined and the out-of-sample model comparisons and significance tests are re-performed. Before we present the detailed results, it can already be stated at this point that the best methods remain the same for the three distribution types, regardless of the performance measure. For this reason, we show only the results based on the mean MSE for reasons of clarity. 15

Inclusion of enterprise-specific variables
In this subsection, we test how additional explanatory variables affect the estimation results. Specifically, we include the following three enterprise-specific (logarithmic) variables: the reported sales in the 12-month period before default, the reported total assets on default, and the total amount of interest-bearing debt. Because of a non-disclosure agreement, this information is not available for all enterprises in the credit portfolio. For this reason, the robustness check is based on a reduced (but sufficiently large) dataset of 4,268 defaulted loans. The LGD distributions of the loans are still characterized by the three distribution types. 16 Table 6 shows the results in terms of the MSEs . It is noteworthy that the estimation errors are reduced for each method in each cluster, that is, the newly added variables seem to be important for the estimation performances of the methods. For this reason, we exemplary analyze the variable importance of the random forest for the (nearly) symmetric bimodal distribution (cluster 1). The importance score of a variable is computed as the total reduction of the node impurity brought by that variable, averaged over all trees in the random forest. In this study, the decrease in node impurity is measured based on the difference between the MSE before and after splitting on a certain variable. The higher the impurity decrease, the higher the importance score of the respective variable as it indicates a higher contribution to reducing the MSE. Figure 2 shows the relative variable importance 17 for each variable in the random forest. It turns out that the additional three variables (represented by dark bars) have a high importance score in the random forest, confirming the relevance of including enterprisespecific variables in the LGD estimation. However, because the inclusion of additional variables usually also carries the risk of overfitting, we would like to point out that such an approach does not necessarily lead to better estimation results in other estimation tasks.
As already mentioned, the main results of this robustness check are similar to those in the preceding subsection. First, for bimodal distributions, all methods show worse performances than for the unimodal distribution. Second, for symmetrically bimodally distributed LGDs, the random forest implies the highest estimation accuracy. Third, the gradient boosting method shows the best performance for the asymmetric bimodal distribution, followed by Gaussian process regression and random forest. Fourth, in the case of a unimodal distribution, the finite mixture model turns out to be best. For all best-performing methods, the paired t-tests confirm their significant superiorities. Therefore, we confirm that the level of estimation accuracy is related to the respective distribution type.

Clustering based on loan-specific variable
A key finding of our study is that the specific modality type of a distribution is crucial for the best-performing estimation method. To rule out that the identified heterogeneities among the distributions are not caused by the approach of clustering, in this robustness check we do not cluster the distributions by country, but by a loan-specific variable. Specifically, we use the number of collaterals deposited for a loan, which has emerged in the literature as one of the most important loan-specific variables for estimating LGDs (see, for instance, Dermine and de Carvalho (2006) or Krüger and Rösch (2017)). Moreover, the analysis of the variable importance of the random forest in the previous subsection also indicates the high relevance of this variable (see Fig. 2).
The clustering strategy is same as in Subsection 2.3: For each number of collateral in the dataset, 18 we aggregate the LGDs of all defaulted loans based on the LGD quantiles in a range from 1% to 100% with a stepwise increase of 1%. We then cluster the resulting ten loan-specific LGD distributions using the agglomerative hierarchical clustering. The results are shown in Fig. 3. 19 It turns out that clustering by a 1 3 Heterogeneities among credit risk parameter distributions:… Table 6 Inclusion of enterprise-specific variables: Out-of-sample estimation accuracies ( MSE) Note. This table reports  loan-specific variable does not lead to any other specific distribution types. Again, we find three clusters, whose distributions essentially differ in their modality type. Table 7 shows the out-of-sample estimation accuracies of the LGD methods. Overall, the results are comparable to those from Subsection 3.3 and can be summarized as follows: First, most of the advanced methods outperform the traditional methods for the three distribution types. Second, the performance of each method increases in each cluster with the size of the training sample. Third, all methods deal better with unimodal than with bimodal distributions, confirming that the level of estimation accuracy is related to the respective distribution. Finally, the superior methods are the same for each cluster, which confirms that the specific distribution type is crucial for the best-performing method.
This robustness check shows that the identified heterogeneities among the distributions even persist when the clustering approach is modified. Of course, it is conceivable that other distributions are relevant in a clustering approach based on other variables such as macroeconomic variables. In such a case, the best estimation procedure must be redetermined.

Logarithmic transformation of the positively skewed unimodally distributed
LGDs In this subsection, we apply a logarithmic transformation to the positively skewed unimodally distributed LGDs (country-specific LGD distribution; cluster 3), which leads to a more normal-like distribution. 20 We investigate whether this approach leads to improved estimation results or change the conclusions regarding the accuracies of the methods. Table 8 shows the estimation performance of the methods in terms of the MSEs. The main results are similar to those for cluster 3 in Subsection 3.3. We find that the estimation errors of all methods are reduced due to the simple nature of the normallike distribution. In accordance with the previous results, the random forest and the gradient boosting show good performance. However, the finite mixture model outperforms significantly the other methods in each sample split. Therefore, this robustness check provides two new insights: First, transforming the unimodal distribution improves the performance of the estimation methods. Second, the finite mixture model also seems to be particularly suitable for more normal-like distributions.

Non-European credit portfolios
In this subsection, we conduct a comparative analysis based on various non-European credit portfolios as a robustness check. Using 6,408 defaulted loans by Latin American, North American, and Oceanian SMEs, provided by Global Credit Data, we rerun our method comparison procedure. The restrictions we applied to the data are the same as those for the European data. 21 The LGD distributions of the defaulted loans are shown in Fig. OA.4 in online appendix and are characterized by the distribution types identified previously. While the LGDs in the Latin American loan and the North American loan portfolios show a symmetric or asymmetric (positively skewed) bimodal distribution (clusters 1 and 2), the Oceanian LGDs are characterized by a (positively skewed) unimodal distribution shape (cluster 3). The LGD estimation methods are evaluated based on their out-of-sample performances. Table 9 shows the results in terms of the MSEs. 22 The main results are similar to those for the clusters in the preceding subsections.
First, for bimodal distributions, all methods show worse performances than for the unimodal distribution. Second, for symmetrically bimodally distributed LGDs, the random forest implies the highest estimation accuracy. Third, the gradient boosting method shows the best performance for the asymmetric bimodal distribution, followed by the random forest and support vector regression. Fourth, in the case of a unimodal distribution, the finite mixture model turns out to be best. Therefore, we confirm that the level of estimation accuracy is related to the respective LGD 22 Again we also perform an analysis based on the R 2 and MAE. The results are available on request. 20 The descriptive statistics are shown in Table OA.11 in online appendix. 21 The descriptive statistics are shown in Table OA.13 in online appendix. distribution type. It is also noteworthy that these results persist even with small loan portfolios such as the Latin American or the Oceanian portfolio.

Conclusion
The literature reveals mixed results on how well LGD estimation methods perform owing to each study using country-specific credit data that have a specific LGD distribution. In contrast, we compare various LGD estimation methods for a large class of LGD distributions. For a broad international loan portfolio, we first identify relevant types of LGD distributions by using cluster analysis and then compare the estimation methods individually for each LGD distribution type.
The cluster analysis leads to three types of distributions, which differ in their modality. We identify a (nearly) symmetric bimodal distribution, an asymmetric (positively skewed) bimodal distribution, and a (positively skewed) unimodal distribution. The estimation accuracies of 20 different methods are tested based on their out-of-sample performance, measured using MSE, MAE, and R 2 . First, for loan portfolios with a symmetric bimodal LGD distribution, the random forest implies the highest estimation accuracy and should be preferred to other methods in an LGD estimation. Second, LGD estimations for loan portfolios with asymmetrically (positively skewed) bimodally distributed LGDs LGD LGD No. collaterals = 4 Frequency Density No. collaterals = 5 Frequency Density LGD

Frequency Density
No. collaterals = 9 LGD LGD LGD LGD Fig. 3 Clustering based on loan-specific variable: LGD frequency and approximated density distributions Table 7 Clustering based on loan-specific variable: Out-of-sample estimation accuracies ( MSE) Note. This table reports Table 9 Non-European credit portfolios: Out-of-sample estimation accuracies ( MSE) Note. This table reports  should be based on the gradient boosting method. Finally, in the case of a unimodal LGD distribution, the finite mixture model shows the best performance. The latter results are supported by a series of robustness checks. This study makes two main contributions to the literature on LGD estimation. On the one hand, we show that different country-specific LGD distributions can be traced to three basic (modality) types, which determine the estimation method to be used. On the other hand, we identify methods that perform best, depending on the modality of the LGD distribution. These results provide general advice for banking practice and regulatory authorities. Instead of an extensive loan portfolio analysis, we recommend that only the LGD distribution type needs to be identified to select the best-performing estimation method.
Furthermore, our study also has relevance for forecasting and estimation problems outside the banking area, because the idea of clustering and identifying different parameter distribution types to determine the respective best estimation procedure is applicable in all areas of predictive analytics. In this way, we obtain a distribution-type-dependent recommendation for method selection. Of course, in case of additional identified distribution types, the performance measurement of the estimation procedures has to be repeated.
Funding Open Access funding enabled and organized by Projekt DEAL. No funds, grants, or other financial support was received.
Data Availability Non-disclosure agreement with Global Credit Data (GCD). GCD is a non-profit association. See https:// www. globa lcred itdata. org/ for further information.
Code Availability The programming code in Python of the applied traditional and advanced methods is available.

Conflicts of interest/Competing interests
The authors have no conflicts of interest to declare that are relevant to the content of this article.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.