Keywords

1 Introduction

1.1 Framework

In recent years, the field of artificial intelligence (AI) has experienced an explosive growth in academic research, resulting in the development of increasingly complex AI systems used in high-stakes applications. As a result, demand has grown for AI models that can provide clear and interpretable explanations of their decision-making processes, leading to the emergence of explainable AI (XAI) as a hot topic in AI research [1]. XAI can be achieved through two main approaches: utilizing models that are intrinsically explainable, such as generalized linear models (GLM) or generalized additive models (GAM), or applying a set of techniques to provide ex-post transparency and understanding of how black-box models arrive at their decisions.

As the drive for explainability continues, concerns are growing about the potential impact of inherent explainability on model performance. This trade-off [2, 3] between performance and explainability means that the most performant models are often less explainable (black-boxes) than intrinsically explainable but less performant ones (glass-boxes). Credit scoring is no exception to this, given the growing role of AI and the need for interpretability. In this article, we explore and discuss the potential cost of explainability as a trade-off that could reduce model performance when using inherently explainable models.

Credit Scoring with AI.

Credit scoring is a key application of artificial intelligence that aims to predict the creditworthiness of actual or potential debtors based on various features. These features can be directly linked to the debtor or to external elements such as the economic environment. Credit scoring models attribute a score or a grade that reflects the default probability of a borrower over a time horizon, usually a year (one-year PD) or the maturity of the credit exposure (lifetime PD). They are used for underwriting and pricing credit, managing credit risk, or computing capital requirements for financial institutions under regulations like Basle IV or Solvency II.

The importance of credit risk within financial institutions has spurred the development of numerous credit scoring models. In recent years, there has been a growing interest in using AI for its ability to capture complex relationships between variables and improve prediction accuracy. Several systematic reviews, including [4,5,6,7,8,9,10,11,12,13,14] provide a comprehensive analysis of AI methods for credit scoring. Additionally, [7] offers a review of the main models and of the articles that present these models.

Interpretable or Explainable.

Explainability is a crucial consideration in credit scoring models, given their commercial significance for underwriting, pricing, and solvency computation. Internal rating-based (IRB) models must comply with strict constraints, which insurers must follow as per the Solvency II Commission Delegated Regulation (EU) 2015/35 of 10 October 2014, and banks as per the EU Delegated Regulation 2022/439 supplementing the Capital Requirements Regulation (CRR) with regard to Regulatory Technical Standards (RTS) for using the Internal Ratings Based Approach. One such constraint is the interpretability of models, which is mandated by regulatory requirements. These requirements have limited the use of advanced machine learning algorithms for IRB models, as highlighted in the EBA discussion [15].

AI-based credit scoring models can either be intrinsically interpretable or benefit from post-hoc techniques to enhance their understandability. Models that are intrinsically interpretable are often referred to as “white-box” or “glass-box” models, such as generalized linear models (GLMs) or generalized additive models (GAMs). In [16], GAMs, GLMs, and their respective strengths are discussed in detail, and the statistical cost of misclassification is assessed when using GAMs that capture non-linear relationships, as compared to GLMs. [17] investigates the interpretability and trustworthiness of GAMs through quantitative and qualitative analyses and generally confirm their inherent interpretability.

There are also models in credit scoring that are referred to as “black-box”, meaning that they lack intrinsic interpretability and require ex-post techniques to improve their interpretability. A detailed review of the main methods can be found in [18]. In addition, [19] provide an updated taxonomy of explainable AI, where they explicit the difference between “global interpretability” (the whole logic of a model and the reasoning leading to all possible outcomes) versus “local interpretability” of a single prediction. They differentiate between intrinsically interpretable models and non-intrinsically interpretable models that can benefit from interpretation techniques such as Shapley values, local interpretable model-agnostic explanations (LIME), partial dependence plots (PDP), individual conditional expectation plots (ICE), attention maps, and others. Although these interpretation techniques have their respective strengths and weaknesses, a detailed description of them is beyond the scope of this research. Interested readers can refer to to [18,19,20,21,22,23] for more information.

It is worth noting that recent developments such as [3, 24,25,26] attempt to overcome some limitations of traditional interpretation methods and to approach closer to the level of interpretability of inherently interpretable models.

In the remaining sections of this work, we will use the term “explainable model” to refer to a model that is intrinsically or inherently interpretable, and the term “interpretable model” to refer to a black-box model that can be interpreted ex-post using appropriate techniques.

In recent years, there has been a growing interest in developing algorithms that combine the performance of black-box techniques like Gradient Boosting [27, 28] or neural networks [29, 30], with the explainability of generalized linear or additive models (GAMs). These hybrid models aim to strike a balance between interpretability and performance. They have shown promising results.

Measuring Credit Scoring Performance.

The assessment of credit risk models involves a wide range of evaluation metrics, as summarized by [31]. The most used statistical measures include AUROC and F-measures derived from the confusion matrix.

However, the effectiveness of statistical metrics in capturing financial results has been questioned in a different context by [32], suggesting that relying solely on such statistical measures may not be sufficient. While most studies use statistical metrics to evaluate credit risk models, only a few have applied some sort of financial measurements. For example, [33] and [34] compute a possible loss, assuming a loss given default (LGD) of 100%, without considering funding costs for the LGD computation. [35] measures the expected loan profitability instead of default risk, using the loan lifetime internal rate of return (IRR), but do not account for LGD. [36] proposes a financial approach that considers LGD for the principal amount only but does not incorporate the link between predicted PD and coupon rate.

In addition to assessing the effectiveness of credit scoring, financial regulation requires that predictive ability, discriminatory power, and stability of model outputs meet certain standards, particularly for IRB models used in solvency computations. Although the banking frameworkFootnote 1 provides detailed guidelines for model acceptability, it has received relatively little attention in academic research. While [37] refers to this regulatory framework, they do not measure compliance with the regulatory rules.

1.2 Original Contribution

Our paper makes a threefold contribution. First, we compare the standard AI models, both the explainable and the black-box ones, with two newly developed explainable models that use black-box techniques to produce a GAM, which aim at balancing performance and interpretability.

Second, we evaluate the models for their regulatory compliance and financial performance, considering the lender’s credit risk appetite, funding costs, predicted PD, and LGD. Third, we introduce the concept of “cost of explainability” to measure the financial cost of using inherently explainable models. Therefore, we compare their achieved economic performance with that of the most performant black-box models.

The rest of the paper is organized as follows: Sect. 2 describes the dataset and data management. Section 3 presents the methodology, including the models’ construction, evaluation metrics, and financial measures. Section 4 reports the empirical results, discusses the models’ performance, regulatory compliance, and financial cost. Finally, Sect. 5 concludes and provides some perspectives for future research.

2 Data

2.1 Description

Our study evaluates various credit scoring models using a dataset supplied by the world leader in trade credit insurance, covering the period from 2019 to 2022. The dataset consists of an anonymized limited and representative set of one-year credit exposures on a group of European borrowers reporting legal and financial information, with total assets above 1 million euros. In total, there are 55 explanatory variables, including 14 balance sheet items, 4 income statement figures, 9 financial ratios, and 9 legal and company descriptive features, plus one year lagged financial information. For confidentiality reasons, we cannot disclose details on the explanatory variables.

We use a training set of 2019 data (76089 rows), and test sets from 2020 to 2022, with 44151, 61406, and 59074 rows, respectively.

2.2 Data Management

Missing Data. The dataset we use for our analysis contains missing values. While some machine learning algorithms can handle missing data, most cannot. Therefore, we test several imputation techniques on the training data to handle the missing values. For univariate imputation, we replace missing values with the median values from the training set. For multivariate imputation, we use KNN, as introduced by [38], which selects the k-nearest neighbors’ to determine the imputation value. We also employ a Bayesian Ridge algorithm with a round-robin iterative process, as described in [39]. We analyze the data drift induced by the imputations in the training data using four different tests: the Population Stability Index (PSI), Kolmogorov-Smirnov (KS), Kullback-Leibler divergence (KL), and Jensen-Shannon divergence (JS) tests. Based on the tests results, we decide to use the KNN imputation method with k set to 4.

Data Quality.

To improve the quality of our credit scoring models, we apply several data preprocessing techniques.

Imbalance Management:

First, we address the class imbalance issue by including all 1,346 defaults from 2017–2018 data in our 2019 training set.

Outliers Management:

Second, we apply a 0.5%-99.5% univariate winsorization to financial variables to remove outliers that could negatively impact model performance.

Features Selection:

Finally, we conduct a feature selection to reduce the dimensionality of the data. Starting from the original set of 55 features, we remove 20 highly correlated variables and ended up with a set of 35 features: 16 balance sheet or cash features (including their 1-year evolution), 5 income statement features (including their evolution), 5 financial ratios, and 9 descriptive, administrative or legal features.

Standardization or Normalization:

To ensure consistency in the analysis of the financial features, we apply either standardization or min-max normalization. We compute the standardization and normalization parameters for each financial feature of the training set and applied them to standardize/normalize the training, validation and test sets. The logistic regression, elastic-net regression, naïve Bayes, linear discriminant analysis, random forest, and support vector machine models use the features standardization, the other models models use min-max normalization. The choice between standardization and normalization has been made following the most common practice for each algorithm. These models are described in Sect. 3.

3 Methodology

3.1 Models

Models Tested. In this section, we describe the credit scoring models that we test for their predictive ability using explanatory variables directly linked to the debtor or external elements such as the economic environment.

We examine twelve different models, which are detailed in [4,5,6,7,8,9,10,11,12,13,14].

Of these models, four are considered inherently ‘explainable’: Logistic Regression (LR), Elastic Net applied to a logistic regression (ELN), Naïve Bayes (NB), and Linear Discriminant Analysis (LDA).

Six models are classified as black-boxes interpretable ex-post: Support Vector Machine (SVM), Random Forest (RF), Gradient Boosting (GB), Light GBM (LGBM), eXtreme gradient Boosting (XGB), and Multi-Layer Perceptron (MLP).

In addition to these models, we also test two recent explainable Generalized Additive Models (GAM), which we add to the group of inherently explainable models. One model is powered with gradient boosting: the Explainable Boosting machine (EBM) [27, 28]. The other model is based on neural networks: Gami-net (GAMI) [30, 40]. Both models have a GAM form:

$$g(E\left[y\right])={\beta }_{0}+ \sum {f}_{j} \left({x}_{j}\right)$$
(1)

that learns the shape function fj of each explanatory feature xj. EBM is using a gradient boosting technique. To learn the features functions and their respective contributions to the model’s predictive capacity, the boosting procedure is restricted to train one feature at a time using a “round-robin cycle”. Gami-net is a GAM of the same format as EBM that relies on a neural network rather to learn the features’ function \({f}_{j}\).

Both models allow pairwise interactions. However, for this exercise, we do not consider pairwise interactions to keep the models sparse and avoid heredity issues related to the hierarchical structure between the main effects and the interactions. As such, we maximize their intelligibility and inherent explainability.

It is worth noting that we do not consider convolutional neural networks (CNN) or long-short-term memory model (LSTM) due to the nature of the available data and the lack of deep historical information, although these models are typically good performers for this type of task.

GAM Isotonicity.

Isotonicity can be incorporated in GAMFootnote 2 models such as GAMI or EBM by constraining some or all of the fj in Eq. (1) to be increasing or decreasing, on their entire range or on part of it. This approach has several benefits. By enforcing isotonicity, outliers can be managed effectively, thereby reducing or excluding their impact on the model’s predictions. To illustrate this, consider a feature like the debt-to-equity ratio (D/E) in a GAM. Expert judgment suggests that lower D/E ratios correspond to lower default risk. If the training set includes one or more defaulted companies with specific low D/E ratio, the fD/E value for the corresponding bin might be abnormally high, which would distort the model’s predictive ability. By imposing a monotonic decrease to fD/E, the model’s predictive power can be improved by addressing the outlier’s issue in the training set.

Not only does isotonicity help manage outliers but it can also enhance the explainability of GAMI or EBM models by aligning each fj more closely with expert judgement. This can improve the models’ interpretability and facilitate effective human supervision. Relying on standard expert judgment in credit risk, we test EBM Isotonic and GAMI Isotonic, forcing isotonicity for 10 of the 35 features of the trained EBM to be isotonic. It is worth noting that the expert judgment only imposes increasingness or decreasingness for fj, not the values themselves within fj. Introducing expert judgment to obtain an isotonic GAM increases the interpretability of the model but might come at a cost in terms of accuracy or financial performance.

Hyper-parameters Tuning.

To secure an efficient use of each model, we perform a hyper-parameters tuning, using a two-step grid search strategy. We start with a wide grid and take into consideration the non-linearities detected during the first grid search, which allows us to better determine the second more narrowly focused grid search.

Reproducibility and Replicability.

To ensure the reproducibility and replicability of our tests, we take several measures. First, we fix the seed for Python, Numpy, and Pytorch. Second, we enforce Pytorch to work with deterministic CUDNN. Furthermore, we perform both training and testing on two different computers and confirm that we obtain identical results. To verify the robustness of each model, we conduct additional training and testing using marginally modified hyper-parameters.

3.2 Performance Analysis

Model Calibration. For each borrower in the test sets and for each year, the models generate a probability of default (PD) between 0% and 100%. To standardize the PDs across the different models, we apply a calibration method based on the BIS IFC conference paper [41]. Specifically, we use the predicted PDs to construct a master scale for each model, which produces a normal distribution of grades while ensuring positive monotonicity of the scale.

Using the estimated PDs and constructed master scale per model, we assign a grade to each borrower for each year and create transition matrices. Our PDs, gradings, and transition matrices are then evaluated against ECB regulatory requirements prior to analyzing their statistical and financial performance.

Regulatory Requirements.

Our analysis adheres to the ECB’s IRB model requirements [42]. We evaluate the applied algorithms for their predictive accuracy and discriminatory power. We examine the stability of the resulting transition matrices. To ensure compliance with ECB standards, we only consider models whose master scales consist of at least 7 grades of not-defaulted credits (ECB required minimum), with a minimum of 0.01% PD distance between two grades.

Predictive Ability:

We use the Jeffreys’ test [43] to assess the prudence of our PD estimates at both the portfolio and individual grading levels. This involves testing whether the forecasted defaults are greater than the observed defaults, assuming a binomial model with independent observations. In addition to the Jeffrey’s test, we conducted a binomial test using confidence interval testing to provide further evidence of the prudent levels of the predicted default probabilities.

Discriminatory Power:

We use the AUROC as a metric to evaluate the discriminatory power of the models. The AUROC measures the ability of the model to distinguish between good and bad credit risks by calculating the area under the receiver operating characteristic (ROC) curve. To ensure that the models accurately separate risky borrowers from less risky ones, we evaluate their rank-ordering performance using the Mann-Whitney U statistic. A higher Mann-Whitney U statistic indicates better rank-ordering performance and, therefore, a more effective model. Unlike the parametric AUROC, which assumes that the predicted probabilities follow a certain distribution, the Mann-Whitney U statistic does not require any assumptions about the shape of the distribution. This makes it a more robust measure for assessing model performance on imbalanced datasets.

Stability:

The analysis of the grading stability over time includes three different perspectives based on borrowers’ grade transition matrices: (i) borrower migrations through matrix weighted bandwidth (MWB), (ii) grades dispersions through the Herfindahl Index (HI) [44], and (iii) the monotonicity of off-diagonal transition frequencies through pairwise z-tests.

MWBs allow us to summarize the downgrades and upgrades and assess whether the models are potentially biased upwards or downwards in their grading, over time.

The HI permits to assess the dispersion in grades by benchmarking the concentration level in grades from a current period to grades from a test period. Hypothesis testing is based on the normal approximation assuming a deterministic HI at current time. The null hypothesis being that the HI of the test year is lower than the current HI, meaning there is no significant increase in the dispersion.

To identify possible portfolio shifts, the monotonicity of off-diagonal transition frequencies in the transition matrix is assessed by means of pairwise z-tests, exploiting the asymptotic normality of the test statistic. The null hypothesis of the tests is that a transition frequency pi,j ≥  pi,j−1 or pi,j−1 ≥  pi,j depending on whether the (i,j) entry in the migration matrix is below or above the main diagonal.

We conduct all tests described above with a 5% confidence interval to ensure the reliability of our findings [42].

Explainability:

When testing the performance of credit scoring models, explainability is an important consideration. While it is difficult to test explainability directly, we can compare and scrutinize models with varying degrees of interpretability. In our analysis, we focus on comparing the explainability of native GAM models with isotonic GAM models that incorporate expert judgment. Isotonic GAM models provide greater interpretability by allowing an expert to enforce monotonicity constraints on the model’s features, for part of the feature shape function or for the entire function. This can hence help ensure that the model’s predictions are aligned with the expert’s expectations and domain knowledge. Accordingly, we can select a model that strikes an appropriate balance between interpretability and accuracy for a specific use case.

Statistical Performance.

The discriminatory power analysis required among the ECB tests includes the computation of the AUROC to which we refer.

Financial Results.

To evaluate the financial performance of the models, we simulate an investment of 100 EUR in each credit proposal with a predicted grade below a defined risk appetite threshold. We assume that if a loan is approved, the borrower agrees to pay a coupon consisting of four elements: (i) a one-year risk-free rate (rfr), (ii) a margin (ftp) for funding and liquidity determined by the Fund Transfer Pricing policy (FTP) set by the ALM department and based on market conditions, (iii) a credit risk premium equal to the PD determined for loan i by each model m (csi,m), and (iv) a commercial margin (cm) to remunerate the capital and the commercial and back-offices departments.

We define a risk appetite based on the maximum acceptable grade, whereby any credit with a grade equal to or lower than the threshold would be approved, and credit with a higher grade would be rejected. We are aware that for a real-life implementation, the risk appetite of the model retained would most likely be based on the expected PD rather than the grade, as this would allow a more granular definition of the risk appetite.

For each approved loan i on a given year, whose PD has been assessed with model m, we compute a financial result:

If the loan defaults, the loss is equal to the principal amount plus the expected coupon times the LGD: resulti = − (1 + rfr + ftp + csi,m + cm) * LGD (or EAD * LGD);

If the loan is repaid at maturity, the result is equal to the sum of the credit spread and the commercial margin: resulti = (csi,m + cm);

where rfr is the 1-year risk free rate, ftp is the cost of funding determined by the ALM, csi,m is the credit spread for loan i defined by model m and cm is the commercial margin defined by the financial institution. We set the LGDFootnote 3 at 45% and assume that the recovery costs are included in the LGD.

This analysis assumes different coupon rates for the same loan since each model might output a different PD. To cope with possible abnormal PD prediction and the competitive environment, we cap the credit spread csi,m at the level of a commonly accepted “reference master scale”Footnote 4. This prevents models outputting very high PDs from artificially benefitting from non-conform credit spread levels (if a grade 5 credit of model j assumes an expected PD of 1.65% while the breakpoint of the reference master scale is 1.15%, the credit spread is set at 1.15% instead of 1.65% generated by the model).

Eventually, we reject models predicting a 100% acceptance rate due to a lack of predictive ability: in such case we do not compute any return and set it to 0%. We acknowledge that with this dataset, we have one-year credit exposures, so the one-year PD is equal to the lifetime PD. For a similar analysis including credits with longer maturities, we should consider the difference between one-year PDs and lifetime PDs.

We can compute the percentage of accepted loans and the return on invested amount for each model. We perform the base case with rfr = 3.25%, ftp = 0.75%, cm = 0.50%, and the risk appetite threshold set at grade 6 on a scale from 0 to 9, corresponding to the breakpoint of 4.33% in the reference master scale. We then test the sensitivity of the returns to a change in the risk appetite threshold.

4 Empirical Results

4.1 Regulatory Requirements

The results of our empirical analysis indicate that three models are unable to produce a master scale with sufficient grades at a minimum distance of 0.01% PD. Specifically, the random forest model results in a master scale with only four grades, while the naive Bayes and support vector machine models produce scales with only five grades. Given that the ECB requires a minimum of seven grades plus an additional grade for defaulted borrowers, these three models are automatically excluded from further analysis. Consequently, we are able to analyze ten models that successfully meet this criterion.

Predictive Ability:

Table 1 shows the number of observations and defaults in our test data.

Table 1. Number observations and defaults in the test data, per year.

At portfolio level, we find that all the investigated models predict prudent PD estimates for all test years, based on Jeffreys test results. At individual grade levels, all models pass Jeffreys test for a minimum of 7 grades and a maximum of all (9) grades, for all years. ELN, EBM, EBM Isotonic, XGB and GB are found to have prudent PD estimates for all grades, over all test years. Closely follow LGBM, GAMI and GAMI Isotonic which have at least 8 prudent PD estimates at PD level. The Binomial test consistently confirms the prudence of the PD estimates found through Jeffreys test. LDA and LR are producing the least prudent PD estimates based on Jeffreys test, with only 7 grades that have prudent estimates for two consecutive years.

Discriminatory Power:

Figure 1 shows the AUROC values for the 10 models analyzed. AUROC is used to assess the discriminatory power and is the most common statistical metric for assessing credit scoring models’ performance.

Fig. 1.
figure 1

AUROC per model per year.

XGB achieves the highest AUROC across all three years, closely followed by EBM, which also exhibited the best performance for inherently explainable AI. EBM Isotonic and GAMI Isotonic have similar statistical results than native EBM and GAMI respectively, suggesting that enforcing isotonicity does not affect the statistical performance.

Although we notice a downward trend in the AUROC values across time for all models, we do not have enough evidence to assess the time consistency of the discriminatory power across time, because of the limited test sample.

Stability:

The over-time stability analysis indicates that all models predict meaningful grades in terms of stability, over all periods. The Coefficients of variation for the 2020–2021 and 2021–2022 periods, used to calculate the HI, range between [1.27, 1.34] and [1.33, 1.43] respectively. As a result, all the HI tests pass and the upper and lower MWBs are relatively close to each other, meaning that grade dispersion is meaningful and none of the models has transition matrices biased towards downgrading or upgrading.

Significant grade transitions during the one-year observation periods affect the stability of the scoring model. Nevertheless, it must be considered that (i) the investigated period is short and (ii) that both observation periods fall within the COVID-19 pandemic period. It follows that significant downgrades and upgrades are to be expected and might not directly relate to the stability of the model but rather to its capacity to adequately adapt to changes in underlying features.

A model’s stability might be questioned when both the number of significant grade transitions and the size of the transitions are material compared to the actual number of transitions. Moreover, this stability analysis enables the model used to further investigate those borrowers for which material grade transitions are identified. We have identified a maximum of 3 significant shifts per period, per model. The highest jumps are resulting from LDA and MLP, with jump sizes of 4,7 and 4,6 respectively. All other significant jump sizes are between 1 and 3. Due to restricted information on the data, we could not perform any further analysis.

Explainability:

Explainable GAMs provide the importance and impact of the features. They also provide the detailed shape of fj. Figure 2 shows the feature impact plots of EBM and EBM isotonic of the bs_012 feature (amount of retained earnings, which have a negative impact on PD). These plots allow us to draw an understanding of the main impact of the model’s features on the estimated PD, given specific feature values. The figure represents a balance sheet feature whose magnitude is inversely proportional to the PD. The higher the feature value, the lower the PD estimate should be under common financial understanding. This relation is enforced in EBM isotonic, which enhances the interpretability of the feature impact plot.

Fig. 2.
figure 2

Feature impact plots resulting from EBM (left) and EBM isotonic (right).

The standard deviation of PD estimates for each region of the feature’s space is used to compute error bars on top of the main effects. These are rough estimates of the model uncertainty in each region of the feature space, which are determined by the amount of training data in each region. The larger the error, the higher the uncertainty and hence the more unstable the model predictions such that the interpretability potential lowers.

4.2 Statistical Analysis

For the statistical analysis results, we refer to the AUROC metric from the discriminatory power analysis performed under the regulatory requirements.

4.3 Financial Analysis

We evaluate the performance of the credit scoring models based on their return and acceptance rate for two risk appetite scenarios with different credit grade thresholds. The first scenario is a low-risk appetite with a threshold at grade 6 whereby all credit exposures with grade 6 or lower are accepted, and those with a grade of 7 to 9 are rejected. The second one is a high-risk appetite with the threshold at grade 7. We also compute a baseline return, for which we assume no model, with all credits accepted and an average credit spread corresponding to grade 5 (above the median of all accepted credit with each model).

Financial Performance:

The results for 2020–2021 are respectively presented in Tables 2 and 3, with models regrouped by their degree of explainability. 2022 results are very similar and add no significant elementFootnote 5. The best models per category in term of generated return are highlighted in bold. The “no model” presents the return if all credits are accepted with a credit spread equivalent to grade 5 of the reference master scale.

MLP is consistently the best performer among all models, while EBM and GAMI are the best performers among the intrinsically explainable models.

Table 2. Credit acceptance rate and return of credit scoring models for 2020.
Table 3. Credit acceptance rate and return of credit scoring models for 2021.

Adding expert judgment through the isotonicity constraint does not alter the financial performance of isotonic models.

Sensitivity Analysis:

To evaluate the return sensitivity to credit risk appetite, we analyze the rejection rate as determined by the grade threshold. As we impose a Gaussian grade distribution, the rejection rate follows a quasi-cumulative normal shape, as shown in Table 4.

Table 4. Average rejection rate per risk appetite threshold.

As we lower the threshold to reduce the risk appetite, the rejection rate becomes very high, causing the vanishing of differences between models.

Fig. 3.
figure 3

Return per model as function of the risk appetite of the lender per grade threshold

The difference between financial performance increases between models with higher risk appetite. Notably, the spreads between models become more visible above grade 5. Figure 3, which presents the return for each model with thresholds between 5 and 8, shows that GAMI is the best performer with a threshold at 5, but MLP becomes the most efficient model from grade 6 and above, followed by EBM and GAMI. LDA does not reject any credit at the grade 8 threshold, so, following our methodology, we reject it and set the return to zero. These results are consistent across all three years analyzed.

Statistical Versus Financial Performance:

The statistical performance of all models shows a rapid deterioration (around -5% over two years) and could support the need for re-training the models on a yearly basis. Nevertheless, the resilience of financial performance tend to indicate that models are stable and could be re-trained less frequently.

When we compare the AUROC with the returns of each model for each year in Table 5, we find no evidence of a positive correlation between the statistical and financial performances. This lack of positive correlation would require an in-depth analysis that exceeds the framework of this paper.

Table 5. AUROC and return per year – Grade 6.

These results suggest that statistical tests may not be reliable in selecting the most profitable model, a conclusion previously drawn by [33] for AI models predicting asset returns. This highlights the importance of assessing financial performance to consider the practical implications of using a particular model in a real-world setting.

4.4 Cost of Explainability

We compare the financial return generated by (i) the best financially performing model (“best model”) with (ii) the best inherently explainable model (“best XAI model”). We find that there is a cost associated with explainability. With GAM models, the cost ranges from 14 to 21bp (Table 6) depending on the risk appetite and time horizon. When we allow human supervision with the isotonicity, we increase the explainability of GAM models at no financial cost. This cost will drop to 0 if we exclude neural networks from the panel for the comparison and if we compare EBM and GAMI solely with decision trees.

The cost of explainability becomes more significant when we limit ourselves to historically predominant models such as LR and exclude EBM and GAMI. In such case, the cost can increase by around 25bp to 30bp, compared to MLP performance.

Table 6. Cost of explainability per year and pet risk appetite (without Isotonic GAM).

The importance of the cost of explainability is to be considered in respect of the specific situation of the lender and the purpose of the model: a commercial model for underwriting and pricing, a risk management model or an internal model for capital requirement measurement.

5 Conclusion and Perspectives

5.1 Conclusion

In this study, we compare machine learning models for credit scoring that pass the bar of ECB minimum requirements. We evaluate their statistical and financial performance on a specific dataset. Our analysis reveals that traditional models like LR, ELN, and LDA perform poorly in both statistical and financial terms. MLP is the best financial performer but among the worst statistical one, while LGBM is a good statistical model with poor financial performance. This highlights the importance of computing the financial performance. We also examine the performance of newly developed explainable models, such as EBM and GAMI, which rely on black-box algorithms to derive GAM outputs. These models demonstrate excellent statistical and financial performance, with the ability to compete with and even outperform decision tree models like XGB in financial terms.

We compute the cost of explainability associated with these inherently explainable models compared with best performing black-box MLP, and find a 0.14% to 0.21% cost of explainability, depending on the risk appetite.

Our study further investigates a way to improve explainability by introducing isotonicity as human supervision. By introducing expert judgment to obtain isotonic GAMs, we improve the explainability of the GAM without compromising the financial or statistical performance. If confirmed with other datasets, these findings could further enhance the attractiveness of GAM models and contribute to their acceptability from a regulatory point of view.

In summary, our findings demonstrate that explainable models, such as EBM and GAMI, are promising alternatives to traditional models, and that isotonicity that expresses expert judgment can be used to enhance their interpretability while preserving their financial and statistical performance.

5.2 Perspectives

While the proposed approach shows promising results, it has only been tested on a single dataset over a limited period, with a low granularity in the risk appetite threshold. Further analysis with different datasets, longer periods, and more granular risk appetite threshold would help confirm or challenge these initial findings. Longer periods would allow the use of more complex neural network architectures and, hence, a possibly higher cost of explainability. Additionally, exploring the use of explainable models for feature selection could be a valuable avenue for future research.

It is also worth noting that the study highlights inconsistencies between statistical and economic results that warrant further investigation. As academic research mostly relies on statistical measures, improving the understanding of how statistical measures reflect or not the economic performance of a model could enhance the application and adoption of academic research in the financial world.