1 Background

Artificial intelligence methods, based on machine learning applied to the data, are rapidly changing financial services, in all areas, such as lending, asset management and payment services, transforming “finance” into “financial technologies”.

While financial technologies, and peer-to-peer lending in particular, improve user experience, and possibly lower costs, they may increase risks. Among them, the risk of inaccurate estimates in credit scoring, i.e. non-proper measures of creditworthiness of the borrowers (“model risk”). The occurrence of this risk may lead to important credit losses, especially when credit is given to large companies. Indeed, incentives are rather different: while in classic bank lending the costs of wrong credit rating assessments are paid by banks themselves, in peer-to-peer lending they are paid by the borrowers.

These considerations suggest, given the increased economic importance of platform lending, that regulators and supervisors should carefully supervise the model risks that arise from credit ratings and their use by lending platforms.

A first important model risk concerns Sustainability, and it arises when the model is not resilient to cyber attacks or to extreme data and, in particular, when it is affected by “external” factors, represented by Environmental, Social and Governance (ESG) factors. The problem is quite challenging. First of all, it is not clear whether ESG factors do impact on credit ratings, particularly as they refer to a long term time horizon, differently from credit ratings. The most important problem is however the lack of standardisation of ESG scores. ESG scores are currently made available by various specialised companies, including rating agencies. The presence of ESG scores in the market can push companies to improve their Corporate Social Performance (CSP) or ESG behaviour [1], but it also presents possible drawbacks. Multiple ESG ratings for a given company can differ and create opaqueness in the company’s actual ESG standing or greenwashing misbehaviour A recent survey by KPMG [2] showed the existence of more than 160 ESG ratings and data providers, with multiple agencies (e.g. Bloomberg, Thomson Reuters, S &P, etc.) whose ESG ratings may however differ. [3] showed little convergence between different ESG ratings. More recently, Abhayawansa and Tyagi [4] provided evidence of the low correlation between ESG ratings issued by different providers. The lack of standardisation of ESG metrics is a problem for both investors and borrowers. From the investors’ point of view, it could be challenging to understand and choose among the ESG ratings to select the best investment opportunities. Similarly, it would be difficult for borrower companies to establish financing plans in a correct way.

We believe that taking into account ESG factors is a necessary step for a sustainable finance and, for this reason, we will consider the issue of Sustainable credit scores, through the investigation of the impact of ESG scores on credit ratings, as the main focus of our paper.

A second important model risk concerns lack of predictive accuracy. Credit scoring in peer-to-peer lending has been studied in a few recent papers, that propose network models to take into account platform risk arising from the connectivity between companies. In these papers, financial network models allow to improve the predictive accuracy of the individual probability of default by considering similarities or linkages among borrowers. This becomes crucial for peer-to-peer lending platforms, in which individuals are able to directly provide small and, in most cases, unsecured loans to small and medium enterprises, without the availability of financial and behavioural information typically leveraged by banks. A network-based scoring model built upon balance sheet similarities between P2P borrowing companies was applied by Agosto et al. [5], while Ahelegbey et al. [6] improved P2P credit scoring models by clustering SMEs based on latent risk factors, deduced from financial ratios. In [5], a network is instead built upon trade flows between the companies joining the platform, proxied by input–output data at the sector level. While network models, and similarly complex machine learning models, may seem appealing, capturing nonlinearities and, thereby, improving predictive accuracy, in some cases they can be limited by their “black-box” nature, which makes it difficult to interpret the results. Although complex machine learning models may reach high predictive accuracy, their predictions are not Explainable, in the sense that they cannot be understood, and therefore oversight, by humans.

We believe that such models may be useful when they improve model accuracy in a manner that overcompensates their lack of explainability, making the further computational burden of making them explainable affordable. This may not be the case when data are of limited quality.

Indeed, following what we already discussed, a third important model risk that may arise in machine learning credit scoring is that of data quality, whose lack may lead to unfair results, as stated, for example, in the recent European Artificial Intelligence Act [7]. The problem of data quality arises in credit scoring when some necessary information is missing or contradictory. This is the case of sustainability factors, encoded in ESG measures: they are not yet standardised, with different data providers assigning a different ESG value to the same company, and with a relatively short time series available. This lack of standardisation may lead to unfair credit ratings, which creates a distorted credit allocation.

We believe that lack of data quality is a real concern that prevents from a correct understanding of the impact of ESG factors on credit ratings. However, in line with our focus, we will employ the data available so far, trying to leverage not only the disadvantages but also the advantages of inconsistent ESG databases.

A fourth important model risk is lack of explainability of the credit scores. This is a very relevant problem for many stakeholders: for investors, who cannot rationalise their investment decisions, not knowing why some companies have a higher score than others; for borrowers, who cannot improve their scores, without knowing the drivers of their values; for regulators and supervisors, which cannot evaluate the impacts of the proposed models, particularly under stress scenarios and, therefore, may not validate them. Complex machine learning models may be highly accurate, as they can capture nonlinearities and interdependences, but are typically “black-box”: they assign predictive scores without explaining their determinants, in terms of the most correlated explanatory variables, as “classic” regression models do, leading to a lack of model explainability. The recent machine learning literature has proposed methods to explain black box models, by means of further processing of the predictive output: see e.g. [8,9,10,11].

We believe that Explainable AI methods are useful, but their extra computational burden is not justified when the available data are of limited quality and/or size. In this case it would be better to build a model that is, while complex, and capable to capture nonlinearities, “explainable by design”, as a simple regression model.

Sustainability, Accuracy, Fairness and Explainability are desirable characteristics of a machine learning model, which should be monitored along time for high-risk applications of AI, as stated in the recent European AI Act (and in similar proposals to regulate artificial intelligence that are being developed worldwide). The importance of these four characteristics highlights the need to build appropriate statistical metrics to measure them, currently not available.

To fill the gap, we have been working in close collaboration between academics and policymakers, within the Milano Hub of the Bank of Italy, to develop a S.A.F.E. learning model that can take Environmental, Social and Governance factors into account.

The result of the collaboration, reported in the present paper, is a credit scoring model for companies that, given the available data, is: Sustainable, as it contributes to sustainability efforts in Finance, by taking into account ESG indicators in the prediction of creditworthiness; Accurate, as it indicates that ESG factors predict to some extent credit ratings, even when controlling for balance sheet information; Fair, as it “compensates” different data providers into one combined score; Explainable, as based on a mixture model whose weights indicate the importance of each ESG score in determining the credit scores.

From a methodological viewpoint, the main contribution of the paper is a data-driven model that describes how ESG scores affect credit ratings, by means of a statistical learning model that is explainable by design, as the final ESG score is a linear combination of the ESG sources, with weights that are proportional to their predictive accuracies.

We remark that the aim of the paper is not to evaluate what is the effect of ESG indicators on credit ratings but, rather, whether there is such an effect, and whether different ESG indicators (or their combined score) contribute differently to this effect (even if potentially limited). This is why we focus on the Bayesian model, which can produce a weight for each indicator, that depends on its accuracy, allowing to judge the accuracy of each ESG score for credit ratings and, furthermore, providing a way to aggregate the indicators in a combined measure that we show to improve accuracy. The weights depend on the in-sample accuracy of each ESG indicator in explaining a target variable related to the company’s creditworthiness, such as the credit rating or a default binary variable. In other words, the combined ESG score will be strongly impacted by good scores and less impacted by bad scores.

The methodology proposed in this paper can be usefully addressed to different stakeholders. For data scientists, it provides assessment metrics for different ESG indicators, which is proportional to their (credit rating) predictive accuracy. For investors, it can provide an aggregate ESG indicator, more robust than single indicators, that can be used in investment decisions. For borrowers, it provides a mean to evaluate long term lending perspectives, taking ESG factors into account.

To our knowledge, this is the first data-driven model based on the relationship between credit ratings and ESG scores, by means of a statistical learning model that is explainable by design, as the final ESG score is a linear combination of the ESG sources, with weights that are proportional to their predictive accuracies.

The remainder of this paper is organised as follows: Sect. 2 presents a discussion on the main focus of our paper: the relationship between ESG factors and credit ratings; Sect. 3 introduces the proposed modelling approach; Sect. 4 presents an application of the methodology to a sample of European companies and, finally, Sect. 5 concludes.

2 ESG scores and credit rating

Corporate Social Performance (CSP) is aimed at evaluating the degree to which companies are sustainable, that is, how they perform their business activities in relation to the external stakeholders and taking into account the economic, environmental, social, and time factors [12, 13]. Environmental, Social and Governance (ESG) factors are often taken as a proxy for the sustainable behaviour of companies.

Environmental factors (E) relate to the impact on the environment deriving from the production of goods or services and include carbon emissions, preservation of the natural environment, biodiversity protection, and waste and water management [14,15,16]. A company that operates with less harm to the environment might reduce the probability of future scandals, legal actions, losses related to legal claims etc. and benefit from a better reputation and lower risks [17].

Social factors (S) refer to the impacts of companies on society, including issues of employee satisfaction, diversity, inequality, gender gap, protection of young and children, investment in human capital and communities, and human rights [14, 18].

Governance factors (F) measure the quality of corporate governance. Shortcomings in governance have been in the past the cause of major scandals and crises, such as the Enron crisis in the USA, Volkswagen in Germany, Parmalat in Italy, and the banking crisis of 2007–2008 [19, 20]. Improved governance settings can contribute to a more sustainable and balanced firms’ growth, therefore contributing to a more sustainable economic development [21, 22].

The above factors are the basis for investment decisions and drive the choice of investors in terms of which companies to finance through equity or debt. To improve the interpretability of ESG, specialised companies (including rating agencies) have started to provide measures and proxies for ESG behaviour, publishing ESG ratings or ESG scores that convey the level of sustainability of companies and the degree of accountability of these companies on ESG aspects [23, 24].

Each rating provider collects information from different sources (company reports, news, stock exchange information, etc.) and applies proprietary methodologies to combine information and produce a summary measure of ESG behaviour. Different methodologies yield different measures, that often produce divergent results [3, 4, 25, 26], and this induces lack of standardisation.

The importance of ESG metrics is bound to grow in the future, with ESG ratings likely to affect investors’ decisions, firms’ ability to finance their investments and pursue a sustainable business model. It follows that understanding whether and how ESG ratings affect creditworthiness is a very important managerial and policy challenge.

To our knowledge, this is the first work to: (1) analyse the relationship between ESG scores and credit ratings through a data-driven model that predicts the company’s credit rating class based on the ESG rating; (2) use the ESG scores assigned by different providers to create a combined metric where each ESG score is weighted based on its predictive accuracy.

In the next section, we describe our proposed methodology, which is applied to real data in Sect. 4. Section 5 concludes the paper with a final discussion.

3 Proposal

In this section, we introduce our proposed Bayesian learning model, which leads to an indicator for the ESG performance of listed companies that integrates the ESG scores assigned by different providers. The indicator is obtained attributing to each available ESG score a weight that is a function of the likelihood of the observed counts of companies belonging to the different credit rating classes, under the alternative partitions generated by the ESG scores. The likelihood weights express in-sample predictive performance and are obtained through the application of Bayes’ theorem.

Our model is based on the assumption that there is an effect of ESG scores and credit rating. However, our aim is not to build a model that employs ESG scores to improve credit rating predictive accuracy but, rather, to investigate the relative importance of each ESG data score. To this end, we extend to the ESG context the methodology proposed by Cerchiello and Giudici [27], who considered the case of estimating a company’s probability of default using a set of explanatory financial variables. Our proposal relies on the modelling approach by Cerchiello and Giudici [27], but applies it to study the relationship between ESG indicators and credit risk, extending what proposed by Agosto et al. [28] to multiple ESG scores and to a binomial response variable.

In [27], based on the mixture of Dirichlet processes model proposed by Giudici et al. [29], it is assumed that the partition \(g_k\) generated by the k-th among K covariates is made up of \(j=1,...,J_k\) levels and that the probability of default of company i, \(Prob(Y_i=1)\), where \(Y_i\) is a binary variable equal to 1 if company i defaults, 0 otherwise) is constant within the same j level of the covariate and equal to \(\theta _j\).

Here, we extend their work assuming that the partition \(g_k\) is generated by the values of the ESG scores assigned by the k-th data provider, and that \(Y_i\) is a binary variable which indicates whether a company rating is speculative (equal to 1) or investment grade (equal to 0). These assumptions do not imply a loss of generality: different partitions can be assumed, for example, corresponding to a combination of ESG scores, and a different binarisation of the rating can be considered to obtain Y.

Letting \(Y_i\) be a Bernoulli(\(\theta _j)\) variable and the \(\theta _j\)’s Beta random variables with parameters \(\alpha \) and \(\beta \), which implies that, a priori, \(E(\theta _j)=\frac{\alpha }{\alpha +\beta }\), the marginal likelihood contribution of level j can be obtained as:

$$\begin{aligned} p(y\Vert j)= & {} \int _{0}^{1}p(y\Vert \theta _j) p(\theta _j) d\theta _j\nonumber \\= & {} \int _{0}^{1} \theta _j^{d_j} (1-\theta _j)^{n_j-d_j} \frac{1}{B(\alpha ,\beta )} \theta _j^{\alpha -1} (1-\theta _j)^{\beta -1} d\theta _j\nonumber \\= & {} \dfrac{\Gamma (\alpha +\beta )}{\Gamma (\alpha ) \Gamma (\beta )}\dfrac{\Gamma (\alpha +d_j) \Gamma (\beta +n_j-d_j)}{\Gamma (\alpha +\beta +n_j)} \end{aligned}$$
(1)

where \(p(\theta _j)\) is the prior distribution of \(\theta _j\), \(d_j\) is the number of defaulted companies and \(n_j\) is the total number of companies sharing level j of the k covariate. Furthermore, B is the Beta function, defined by:

$$\begin{aligned} B(z_1,z_2) = \frac{\Gamma (Z_1)\Gamma (Z_2)}{\Gamma (Z_1+Z_2)}, \end{aligned}$$

where for each positive integer n:

$$\begin{aligned} \Gamma (n) = (n-1)! \end{aligned}$$

Under the assumption that the \(\theta _j\)’s are independent random variables, the marginal likelihood of the partition \(g_k\) is:

$$\begin{aligned} p(y\Vert g_k)=\prod _{j=1}^{J_k}p(y\Vert j), \end{aligned}$$
(2)

which determines the posterior probability of the partition:

$$\begin{aligned} p(g_k\Vert Y) \propto p(y\Vert g_k)p(g_k), \end{aligned}$$
(3)

where \(p(g_k)\) can be set a priori, for example, according to the uniform distribution: \(p(g_k)\propto 1/M\) where M is a constant.

The expected probability of default of company i, conditional on the available set of covariates X, can then be obtained as follows:

$$\begin{aligned} E(\theta _i\Vert X,Y)=\sum _{k=1}^{K}E(\theta _j\Vert g_k,Y)p(g_k\Vert Y), \end{aligned}$$
(4)

with \(E(\theta _j\Vert g_k,Y)=\dfrac{\alpha +d_j}{\alpha +\beta +n_j}\), in which the posterior probability \(p(g_k\Vert Y)\) acts as k-th covariate weight in determining the expected probability of the default event.

Equations (3) and (4) summarise the essence of our proposed machine learning model. It is a Sustainable model, as it allows to measure the impact of ESG factors on credit ratings; it is a Fair model, as it averages the contribution of different ESG providers, compensating their differences, due to different objectives; it is an Explainable model, as it is a linear combination of weights with posterior probabilities, which, although calculated in a nonlinear way, have a clearly interpretable meaning. In the next section, we will verify, for our available data, whether the model is also accurate, that is, whether ESG factors have a predictive relevance for credit ratings, and what are the relative weights of each ESG factor in the model.

For the sake of comparison and completeness, we consider as benchmark model XGBOOST [30], for its well-known capability of modelling nonlinearity in a very efficient way, without imposing any distributional assumption. Moreover, together with deep neural network, they represent the state of the art, as far as the overall accuracy is concerned. Since deep neural network cannot be profitably employed in the current exercise, given the dimensions of the dataset, we resort to XGBOOST. Indeed, the latter is an ensemble model which works over the idea of combining several weak classifiers to create a strong one characterised by extremely good performance thanks to a regularised gradient boosting framework. As further term of comparison, we also consider bagging and random forest models which belong to the same ensemble approach family but exploiting different strategies [31].

4 Application

4.1 Data

In this section, we apply our proposed methodology to a sample of 1382 European companies for which we retrieve:

  • the MSCI ESG Score: a continuous variable ranging from 0 (lowest sustainability) to 10 (highest sustainability);

  • the Refinitiv ESG Score: a continuous variable ranging from 0 to 100. As for the MSCI ESG score, higher values indicate better sustainability profiles;

  • the Standard and Poor’s (S &P) Global ESG Rank: a discrete variable defined as the total sustainability percentile rank, ranging from 0 (lowest sustainability) to 100 (highest sustainability);

  • the risk class assigned to the company based on the Bloomberg Issuer Default Risk model generated probability of default over the next one year: an ordinal variable whose categories in the sample range from IG1 (highest credit worthiness) to D4 (lowest credit worthiness). Specifically, classes from IG1 to IG10 identify Investment Grade bond issuers, while classes from HY to H6 and from D1 to D4 identify High Yield and Distressed bond issuers, respectively. Starting from the rating class information, we define a binary variable which is equal to 1 if the company belongs to a speculative (high-yield or distressed) class, 0 otherwise. This will be our target variable in the application of the Bayesian model presented in Sect. 3;

  • a set of 13 financial ratiosFootnote 1 which should reflect company profitability, growth and liquidity, together with the value of market capitalisation, which serves as a dimensional indicator.

To allow the comparability of the scores, the MSCI ESG score has been rescaled in the 0–100 range.

Data are the last available as of August 3, 2022, and is retrieved from various sources: MSCI ESG Research (for the MSCI ESG scores), Refinitiv LSEG business (for the Refinitiv ESG scores), Bloomberg (for the S &P Global ESG rank and the credit ratings). All Data is pre-processed so that no missing values are present in our sample. In our setting, among the European companies having an ESG rating, we only select those (1382) for which all three ESG scores are available at the considered date. The data have got a cross-sectional structure, all being referred to a single date, the 3rd of August, 2022.

The distribution of sample companies among the credit rating classes is shown in Fig. 1.

Fig. 1
figure 1

Distribution of the analysed companies by Bloomberg credit rating. Source: own elaborations based on Bloomberg data

Figure 1 shows that, for the considered companies, the distribution of ratings is quite skewed to the right, and that there is a large group of companies with very high ratings (IG1). Both aspects will make it more challenging to attain a good level of predictive accuracy.

As it can be seen from Fig. 2,Footnote 2 the distribution of the three ESG scores in the analysed sample is instead left-skewed, meaning that a few number of companies have a much worst ESG evaluation than the mean one.

Fig. 2
figure 2

Distribution of the analysed companies by ESG score. To allow the comparability of the scores, the MSCI ESG score has been rescaled in the 0–100 range. Source: own elaborations based on MSCI ESG Research, Refinitiv (LSEG business), S &P Global and Bloomberg data

Concerning the concordance between the ESG scores, it can then be noticed from Tables 1, 2, 3 and 4 that correlation between the Refinitiv and the S &P ESG scores is relatively high according to the Pearson and Spearman measures, but decreases to nearly 50% when moving to rank-based concordance measures. Correlation between the MSCI ESG scores and the other two indicators is instead low, never reaching 40%. This increases the interest in reaching a sustainability metric that combines alternative ESG scores based on their capability to order the observed companies by their creditworthiness.

Table 1 Pearson correlation between the ESG scores
Table 2 Spearman correlation between the ESG scores
Table 3 Kendall’s tau correlation between the ESG scores
Table 4 Somers D correlation between the ESG scores

4.2 Results

4.2.1 In-sample analysis

The first step in our empirical analysis consists of the calculation of the posterior probability-based weights according to the methodology described in Sect. 3. Having no a priori reasons to assign different weights to the scores, we set the M constant in (3) equal to 3, which means that the three scores are a priori equally weighted. The \(\alpha \) parameter is set equal to the ratio between the number of investment grade companies and the total number of companies in the sample, so that \(\beta =1-\alpha \) is set to be the proportion of speculative grade companies in the sample.

The posterior weights associated with the scores are estimated on a random training sample of 829 companies (60% of the available observations) and are shown in the second column of Table 5. The third column of Table 5 reports instead the weights obtained by applying the same methodology to the residuals of stepwise linear regression models where the dependent variable is a given ESG score (MSCI, Refinitiv or S &P) and the regressors are the company’s balance sheet variables and market capitalisation. This allows indeed to consider the extent to which the financial information—on which both the ESG scores and the credit ratings are supposed to be related to—influences the capability of ESG scores to predict the credit ratings. Coefficient estimates for the estimated linear regression models are shown in Tables 7, 8 and 9.

Table 5 Weights derived from the posterior probabilities associated to the ESG scores, before and after controlling for financial ratios

Table 5 shows that model weights are somewhat different, before controlling for financial ratios. But also that such difference nearly disappears, once controlling for the same ratios. This may be the effect of different attention given by the providers to the financial ratios. Once they are taken into account, however, the ESG scores have a similar importance, in determining credit worthiness. This shows that our proposed model is able to improve fairness, reducing inconsistencies among the data providers. And, by taking an equally weighted average of the ESG scores, it does not generate any bias deriving from using one rather than the other.

We also remark that the weights in Table 5 are the main output of our proposed model: a set of weights which is easy to interpret and implement in the monitoring of credit risk.

In other words, with the in-sample analysis we have shown that our proposed model is fair and explainable.

4.2.2 Out-of-sample analysis and robustness

We now provide a predictive analysis where the probability that a company belongs to a certain rating class—conditional on the ESG score—is estimated based on the methodology described in Sect. 3.

Specifically, we use the weights associated to the ESG scores estimated on the training sample (see Sect. 4.2.1) to predict the credit rating in the validation sample (40% of the available observations). According to the proposed merged scoring methodology, the weights are then used to determine, for each company, and for each provider domain (Refinitiv, Standard and Poor’s, MSCI) the probability associated to each of the two considered rating categories: Investment Grade or Speculative (High Yield or Distressed) class.

Figure 3 shows the posterior probabilities associated to the different classes of the ESG score distribution, for each of the three scores considered. These probabilities are used to determine the probabilities assigned by the merged score. Indeed, for each company, the probability of belonging to a speculative rating class is calculated as the weighted mean of the probabilities assigned by the three scores, using the Bayesian likelihood-based weights.

Fig. 3
figure 3

Estimated probability of belonging to a High Yield or Distressed credit rating class by ESG score class, before (left) and after (right) controlling for financial and dimensional indicators. The dashed line indicates the ratio of speculative-grade rated companies in the sample. Source: own elaborations based on MSCI ESG Research, Refinitiv (LSEG business), S &P Global and Bloomberg data

Figure 4 shows the ROC curves of the credit rating prediction based in the ESG scores, obtained by applying the Bayesian model.

Fig. 4
figure 4

ROC curve of credit rating predictions based on ESG scores, before (left) and after (right) controlling for financial and dimensional indicators, obtained through application of the proposed Bayesian model. Source: own elaborations based on MSCI ESG Research, Refinitiv (LSEG business), S &P Global and Bloomberg data

From Fig. 4 note that there is no absolute dominance of one specific ROC curve. The relationship depends strictly on the quantiles of reference. More in detail, if we compare the related AUROC measures, the two leading models are the Merged score and the MSCI ones. The Merged score model is, furthermore, more robust (more sustainable in the statistical sense) as it does better in modelling the tails of the distribution, where the more extreme financial profiles lie: companies that are either very bad or very good.

The results are confirmed after controlling for the financial ratios. We can conclude from Fig. 4 that the merged model leads to predictions that are better than those of the single ESG scores on the tails and, in particular, for high cut-off levels. This means that the merged model is resilient to extreme values (upper tail): its performance does not decrease when extreme values are considered, as indìvidual ESG models do. The proposed model is thus a sustainable credit rating model, as it shows that ESG factors are important to predict credit ratings, even when financial variables are inserted into the model. The proposed model also improves predictive accuracy, with respect to what the separate ESG scores would do.

A question that may arise, especially for the sake of comparison, is whether a different (non-Bayesian) machine learning model would improve predictive accuracy, although being not explainable. If it were so, computationally expensive explainable AI methods, such as Shapley values [8, 10] could be applied as an “add-on” to the model.

To this end, we additionally fit a competing model, which is typical expression of machine learning approaches.

As already introduced, we fit a XGBOOST by means of the package ’xgboost’ of R software and by setting three tuning parameters as follows: a parameter d, which determines the depth of each boosted tree; a learning parameter \(\eta \), which determines the updating rate, and a parameter B, which determines the number of boosted trees. We select the values of such parameters after a fine tuning exercise and specifically we take: \(d=1\) or 2;  \(\eta =0.001\); \(B=5000\). Indeed the parameter d controls for the complexity/size of the trees in terms of considered variables and depth levels. Given the limited number of variables, we consider very small values for d: 1 and 2. The features employed are the three ESG scores, exactly as for the Bayesian model and we estimate XGBOOST on a 60% training set and we evaluate it on the remaining 40% test set (same strategy employed for the Bayesian model).

We ended up with a boosting model whose predictive performance is reported in Fig. 5. For the sake of completeness, we have also included two further classification models, namely Bagging and Random forest, which are ensemble methods still based on classification trees.

In Table 6, we report a full comparison of the different approaches.

Fig. 5
figure 5

ROC curve of credit rating predictions based on ESG scores obtained through the XGBOOST algorithm. Source: own elaborations based on MSCI ESG Research, Refinitiv (LSEG business), S &P Global and Bloomberg data

Table 6 reports the value of AUROC (area under the ROC curve) and AUPRC (area under the precision and recall curve) for the individual ESG scores, the merged ESG score, the XGBBOST, the bagging and the random forest. AUROC accounts for the overall accuracy for each and every possible threshold, AUPRC similarly considers the areas under the precision and recall curves regardless the threshold. Such strategy allows us to produce a robust quality assessment of the competing models, without imposing any subjective assumption. From Table 6, we infer that either AUROC or AUPRC are very close to each other when considering the Bayesian model and the XGBOOST. Indeed, the slight improved accuracy of XGBOOST is limited and it is not statistically significant.

Table 6 AUROC and AUPRC of credit rating predictions based on ESG scores, obtained through application of the proposed Bayesian model and the XGBOOST algorithm

Indeed, the proposed Bayesian model does not offer an exceptional performance, especially because the effect of ESG factors on credit ratings is probably limited, but it has a clear and unavoidable advantage: it is explainable by design and it offers a system of weights that can be used in further analysis. On the other hand, the XGBOOST model, which is not explainable by design, does not lead to a gain in predictive accuracy that can justify the use of a computationally expensive AI method, such as Shapley values. The same applies to Bagging and Random Forest which show even worse performances than XGBOOST.

We remark that both XGBOOST or bagging/random forest can be made explainable (in a qualitative sense) using a variable importance plot. Unfortunately, the variable importance plot is not fully agnostic: we cannot use it for the Bayesian model, for example, and, thus, make comparisons. In this regard, we report in Fig. 6 the variable importance plots obtained upon the XGBOOST algorithm. Two measures are used for the ranking of the used variables: mean decrease accuracy and mean decrease Gini. Both agree on the ranking in the importance of the variables: first ESG from Refinitiv, second ESG from S &P, third ESG from MSCI. The results confirm what obtained from the Bayesian model, that is the relevance of ESG scores produced by Refinitiv. As a second important variable, the variable importance plot selects ESG scores from S &P conditionally on Refinitiv, differently from the Bayesian model which proceeds with a simultaneous selection. We remark that the weights and the importance attributed to the different ESG scores has merely a descriptive purpose within the framework of our model and it does not imply any evaluation of their inner quality.

Fig. 6
figure 6

Variable Importance Plot obtained upon the XGBOOST algorithm. Source: own elaborations

Although a computationally expensive explainable AI method may not be justified in our context, we have tried to interpret the predictions obtained from XGBOOST with a graphical method, comparing the plots of the estimated probabilities by the three ESG scores, similar to what obtained in Fig. 3 for the Bayesian model, reported in Fig. 7.

Fig. 7
figure 7

Estimated probability of belonging to a high yield or distressed credit rating class by ESG score class, with the XGBOOST model. Source: own elaborations based on MSCI ESG Research, Refinitiv (LSEG business), S &P Global and Bloomberg data

Comparing Figs. 7 with 3 (left), note that the behaviours of the estimated probabilities are rather similar. In both cases, there is an overall negative dependence between the ESG score class and the probability of default; moreover, the range of variation of Refinitiv is the smallest. This implies a higher weight in the Bayesian model for Refinitiv than for more discriminant scores, such as S &P. Figure 7 shows that the probabilities estimated by the XGBOOST have generally a lower variability, with respect to those from the Bayesian model. This is in line with the smoothing effect carried out by the (nonlinear) XGBOOST model.

5 Conclusions

In the paper, we have shown how credit worthiness could be measured by means of a S.A.F.E. machine learning model which reduces model risks, in line with the emerging regulations of artificial intelligence, which aim to measure the risks of artificial intelligence to promote its usage.

The model is Sustainable, as credit ratings can take Environmental, Social and Governance factors into account. The model is Accurate, as it indeed shows that ESG scores have an effect, although limited, in the prediction of credit ratings. The model is Fair, as it can level out differences between different ESG data providers, taking an averaged score. The model is Explainable as it can be easily interpreted by means of a set of normalised weights assigned to the different ESG providers, which are function of their relative predictive accuracy.

We believe that this paper is the first of this kind, and it may generate debate and impact, in the AI and in the financial community altogether.

This, in particular, because it can improve ESG standardisation, providing a solution to the problem of multiple ESG ratings. The increased attention to sustainability issues has yielded the proliferation of rating agencies and ESG scores, with multiple ESG scores on the market that are often divergent and provide different types of information. In the paper we show how to combine different ESG scores into a single ESG one that combines the information given by different providers of ESG scores. A combined score can bring better evidence on whether ESG factors can be predictors of credit rating classes.

Our findings have many implications for the application of statistical learning and artificial intelligence methods in the financial sector. What presented can be useful for investors in financial markets, who can exploit the information provided by different ESG scores in a comprehensive setting, reducing information asymmetries on ESG company performance. It can also be useful for lenders in credit markets, as they can make a better informed use of ESG factors in determining credit worthiness, to the benefit of the best-performing companies in terms of sustainable behaviour. It can be of interest also for insurance companies, helping to assess pricing of climate and ESG related events.

Our research is also of interest for regulators and supervisors in the financial sector, as it provides a standardised metric to measure the impact of different ESG scores, along with a combined score, thereby improving the assessment of the sustainability of the company which receives ESG ratings. And, finally, it is important for ESG data providers, as they can receive feedback on the relative quality of their metrics, and, possibly, improve them.

We also remark that the scope of this paper is to provide indications to financial institutions on the relative quality of different ESG providers (in terms of their predictive accuracy).

Future research could replicate how our results, obtained on the available data, clean of missing values, can be extended, to a different and possibly larger database.

Future research should also extend our work to cover companies for which ESG scores are missing for some providers. Our approach can be easily generalised to this context assigning companies with missing scores to a distinct new category that contains all companies with missing information.

Future research should also concern the implementation of the proposed methodology to other regulated industries, such as the health care and the automotive sectors, and, possibly, to other high-risk artificial intelligence applications.