Machine learning due diligence evaluation to increase NPLs profitability transactions on secondary market

Carannante, Maria; D’Amato, Valeria; Fersini, Paola; Forte, Salvatore; Melisi, Giuseppe

doi:10.1007/s11846-023-00635-y

Machine learning due diligence evaluation to increase NPLs profitability transactions on secondary market

Original Paper
Open access
Published: 15 March 2023

Volume 18, pages 1963–1983, (2024)
Cite this article

Download PDF

You have full access to this open access article

Review of Managerial Science Aims and scope Submit manuscript

Machine learning due diligence evaluation to increase NPLs profitability transactions on secondary market

Download PDF

Maria Carannante ORCID: orcid.org/0000-0002-5524-0613¹^na1,
Valeria D’Amato¹^na1,
Paola Fersini²^na1,
Salvatore Forte³^na1 &
…
Giuseppe Melisi⁴^na1

1644 Accesses
1 Citation
Explore all metrics

Abstract

In this paper, we contribute to the topic of the non-performing loans (NPLs) business profitability on the secondary market by developing machine learning-based due diligence. In particular, a loan became non-performing when the borrower is unlikely to pay, and we use the ability of the ML algorithms to model complex relationships between predictors and outcome variables, we set up an ad hoc dependent random forest regressor algorithm for projecting the recovery rate of a portfolio of the secured NPLs. Indeed the profitability of the transactions under consideration depends on forecast models of the amount of net repayments expected from receivables and related collection times. Finally, the evaluation approach we provide helps to reduce the ”lemon discount” by pricing the risky component of informational asymmetry between better-informed banks and potential investors in particular for higher quality, collateralised NPLs.

Default or profit scoring credit systems? Evidence from European and US peer-to-peer lending markets

Article Open access 12 April 2022

ML Algorithms for Providing Financial Security in Banking Sectors with the Prediction of Loan Risks

Application of Machine Learning Algorithms for Creating a Wilful Defaulter Prediction Model

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

According to Banca Ifis report 2021/2022, “even though in 2020 Italy still has a Non-Performing Exposures (NPE) ratio above the EU average, we expect European NPE stock to increase by 60 billion euros in 2022–2023, worse than that estimated for the Italian financial system”. From 2017 to 2020, over 50 billion euros were invested in the Non-Performing Loans (NPLs) market to buy approximately 214 billion euros of NPLs portfolios.

Nevertheless, the intense transactional activity on the secondary market - the incidence of 32$\%$ in 2021 (IFIS Banca (2022)) - determined the dynamism of the NPLs market. For instance, based on the annual report (Osservatorio Nazionale NPE Market (2020)) by the National Observatory NPE Market of Credit Village (Italy), the Italian secondary market recorded a significant boom as regards 2020: 284 with 12 billion in euros of Gross Book Value (GBV). The main highlight of the survey relies on advanced maturity degree, after a chaotic triennium from 2017 to 2019, which has been relevant to the contribution of foreign investors in the divestiture of whole portfolios.

The market shows deals divided between secured and unsecured loans, with a high incidence of corporate customers, with a high component of an unsecured portfolio. The market shows a sort of normalization with a greater balance between the types of portfolios transacted. On the contrary, the first years of growth of the NPLs transaction market were characterised by a high concentration on Unsecured credit portfolios (Fig. 1).

The transfer of the NPLs to the secondary market can represent a de-risking strategy for the banks operating in the primary market since it relaxes the burden on the NPLs management. Indeed it could help banks offload NPLs from their balance sheets and distribute the risk, free up bank resources, and strengthen bank stability. In other words, the transactions on the secondary market appear an attractive de-risking strategy, due to the high flexibility characterising the securitization structure. In particular, the new legal framework, the EU Directive (EU) 2021/2167 on credit servicers and credit purchasers (also known as the NPLs Directive), could promote an increase in business for specialised credit services. The new rules Directive took effect on 28 December 2021 with the deadline for implementation in all member states being 29 December 2023. They work to expedite and regulate the development of a secondary NPLs market in Europe through disclosure requirements towards the purchasers, notification requirements to the borrowers, and reporting rules to the regulators. More competition and lower costs in transactions are expected in the NPLs marketplace.

The NPLs transactions on the secondary market by portfolio type confirm the trend of divestment of unsecured shares which require specialised services. In light of these considerations, in this paper, we focus on secured NPLs transactions. The profitability of the secured segment is influenced by the liquidation value of the property of the defaulted mortgages. The idea underlying our research is to accurately estimate the profitability of transactions by fair due diligence. We propose to improve the due diligence process by developing an artificial intelligence algorithm. We set up a machine learning due diligence framework to accurately evaluate the expected recovery rates of defaulted mortgage loans.

One of the most important factors governing the price of NPLs portfolios is the recovery rate - that is, the percentage of exposure that can be recovered from each borrower through the debt collection process. To the best of our knowledge, there has been no contribution to explainable machine-learning decisions for NPLs recovery rate models. The only research on the topic focused on an empirical comparison of different classes, i.e linear, nonlinear, and rule-based algorithms for identifying the model structure best suited to the recovery rate problem (Bellotti et al. 2021). Nevertheless, most importantly the aim of the authors consists in designing “a new set of behavioural predictors based on data regarding the recovery procedure promoted by the bank selling the NPLs”. The case of the retail loan recovery rates is examined by appropriate predictive models for macroeconomic factors as in Nazemi and Fabozzi (2018); Nazemi et al. (2018).

Other interesting research investigates how to apply linear regression, beta regression, inflated beta regression, and a beta mixture model combined with logistic regression to model the recovery rate of non-performing loans (Ye and Bellotti 2019).

Unlikely the balance sheet evaluations of NPLs on the primary market, the pricing on the secondary market faces the challenges of market frictions and asymmetric information arising as buyers know less about asset quality than sellers. ”Buyers would therefore fear that assets they are bidding for are of low quality, and bid at a correspondingly low price. The sellers, being able to distinguish between low and high-quality assets, trade only in the former type - the lemons - whereas the market for the remaining assets fails. Additionally, it may be the case that sellers of NPLs may not have perfect information concerning their own assets. The resultant problems associated with informational asymmetry remain, however, as buyers cannot know whether sellers are revealing all available information.” (European Central Bank (2016)). In this context, we provide a tailor-made pricing approach on the secondary market which quotes the risky component of asymmetric information.

The layout of the paper is the following. Section 2 provides practical considerations on the new legal framework, the so-called European Union directive on non-performing loans. Section 3 proposes an ad hoc Dependent Random Forest Regressor algorithm i.e. dependent Forest based on Non-Linear Canonical Correlation (DF-NLCC) based on the specialized splitting rule. Section 4 proposes a pricing approach for NPLs portfolio on the secondary market. Section 5 illustrates the main outcomes of the empirical applications. Section 6 concludes.

2 New legal framework or European Union directive on non-performing loans

The NPLs Directive (EU Directive 2021/2167) regulates the activities of the key parties in the NPLs secondary market, in particular credit servicers, credit purchasers and borrowers. It introduces a new set of rules to increase the attractiveness and competitiveness of the secondary market, driving down credit servicing costs, and subsequently determining the lower cost of entry for potential credit purchasers, potentially leading to increased demand and higher NPLs sale prices for the credit purchasers.

The NPLs Directive properly acknowledges the existence of a European secondary market for loans and creates a devoted framework through the authorisation and supervision of the credit servicers.

New requirements are imposed on credit servicers in particular when interacting with borrowers.

On one hand, they have to put in place specific agreements with credit purchasers when outsourcing their credit servicing activities to the credit service providers. On the other hand, selling banks will be subject to providing all necessary information to credit purchasers to enable them to assess the value of the loan assets offered for sale and the likelihood of the recovery. In general, the main objective of the regulation is the increase of the disclosure in requirements towards the purchaser, notification requirements to the borrower and reporting to regulators.

Member states have until 29 December 2023 to implement the Directive. The implementation phase is a transition period, which leaves time for the market players to assess the implications and opportunities it offers. For instance, the different definitions in the Directive among credit purchasers, credit servicers, and credit servicing activities may seem a bit confusing at first glance, by involving some uncertainty also in the categorisation of NPLs (and how sales of combined portfolios comprising NPLs and performing loans are to be dealt with). Nevertheless, the new regulation witnesses the interest and the need to harmonise the regulation of Europe’s secondary NPL markets while protecting borrowers’ rights.

3 Dependent forest based on non-linear canonical correlation (DF-NLCC): a specialised splitting rule

In this section, we propose a Dependent Forest algorithm based on Non-Linear Canonical Correlation (DF-NLCC), defined by a specialised splitting rule for projecting the recovery rate of a portfolio of secured NPLs. We develop this technique in order to capture the variables linked to an NPLs portfolio characterised by a complex dependency structure. In particular, considering the case of a secured NPLs portfolio, the recovery rate is a time-dependent variable, since the shorter the recovery time of the credit the higher the recovery rate. In this sense, the recovery rate can be considered a random variable whose parameters are time-varying. Similarly, the covariates that determine a secured NPLs portfolio have relationships with each other and they depend on recovery time. In fact, the time required to recover credit depends on the Region, the legal entity type and the economic sector. We consider these variables as determinants of the recovery rate of an NPLs portfolio. Moreover, since we take into account secured NPLs that depend on real estate collateral, the book value of the credit is dependent both on the value of the underlying real estate and on the waiting time of the collective mandatory action.

Despite widespread use, the public availability of data on NPLs portfolios is very limited. The main reason depends on the confidentiality of the data for credit holders or for stakeholders interested in buying a portfolio. To cope with the lack of data, we propose a simulation analysis that helps to define the parameters of a series of random variables. We set up secured NPLs portfolios whose recovery is achieved through collective mandatory actions: The duration of the legal action is simulated using a two-parameters Gamma random variable, whose shape parameter k and the scale parameter $\theta$ are estimated using an iterative procedure that minimizes the difference between the ratio of the first quantile and the mean and the ratio of the third quantile and the mean of the general distribution of duration of the bankruptcy. The choice of the Gamma distribution lies in its interpretation as the waiting time until an event, i.e. the recovery of the credit through collective mandatory action. Starting from the Gamma random variable defined above we estimate a series of distributions of the waiting time for collective mandatory action, distinguishing among geographical areas, legal forms and business sectors. In particular, the scale parameter $\theta$ is assumed constant for each variable, since the differences among the expected values for categories are negligible and the parameter k is estimated using the same iterative procedure for the general distribution of the duration of the collective mandatory action. The Recovery Rate (RR) is estimated by taking into account the relationship between credit recovery and time so that a series of beta distributions are estimated, and parameters are determined using an iterative procedure for different classes of time. The Beta distribution is suitable for the estimation of RR since it is generally used for modelling a certain proportion of a given phenomenon. The Book Value (BV) is estimated taking into account the positive correlation with RR. Since the BV is a continuous variable, the distributions are estimated using a Gaussian distribution, with mean and variance estimated by using an iterative procedure using the data of the distribution of the RR. The difference between the real estate value and the BV is estimated taking into account the relationship between the two variables. Considering the real estate value as a percentage of the BV, the distributions are estimated by means of a Gaussian distribution, with mean and variance being estimated with an iterative procedure on the data of the BV distribution.

The random forest methodology can be described as an ensemble of decision trees. The concept underlying a random forest consists of averaging multiple decision trees affected by high variance, in order to develop a better generalization “less susceptible to overfitting” (Raschka and Mirjalili 2017). Several different approaches for creating an ensemble of classifiers combine multiple classifiers into a meta-classifier that obtains better performance than each individual.

The random forest can be defined as a general principle of classifier combination that uses L tree-structured base classifiers ${(x,\Xi _k ),k=1,\ldots ,L}$ where ${\Xi _k}$ represents a family of independent identically distributed random vectors, and x is an input data. It involves that no guarantee that all those trees will cooperate effectively in the same committee. Bernard et al. (2009).

Nevertheless, the underlying data-generating process presents dependence properties, which asymptotically influence the distribution of the statistics of interest. This issue justifies the development of an algorithm framework proposal, which really mimics the dependence properties (or even the process), in order to avoid dependency risk as codified in D’Amato et al. (2013). In the random forest context, the dependency risk consists of misleading evaluations in classification or regression. In our proposal, the dependent forest we design consists of many unsupervised decision trees with a specialised splitting criterion. In particular, the specialised splitting criterion relies on nonlinear relationships of a variable with some variables.

We build the individual trees in the forest with a splitting rule specifically designed to partition the data to maximize the non-linear canonical correlation heterogeneity between nodes arranged as in OVERALS (van der Burg et al. 1994). The canonical correlations represent how much variance of the dependent variables is explained by the dimensions, where the canonical dimensions are latent variables that are analogous to factors obtained in factor analysis, except that canonical variates also maximize the correlation between the two sets of variables. In general, not all the canonical dimensions will be statistically significant. A significant dimension corresponds to a significant canonical correlation and vice versa.

We design the architecture of an unsupervised random forest based on the set of covariates Z to find subgroups of observations with non-linear canonical correlations between mean-centered multivariate datasets X and Y.

The tree-growing process is based on the Classification and Regression Tree (CART) approach (Breiman et al. 1984). The basic idea of tree growing with CART is to select the best split at each parent node among all possible splits to obtain the nodes. Inspired by Alakuş et al. (2021), where the conditional canonical correlations between two sets of variables given subject-related covariates have been estimated, we propose a splitting rule that increases the non-linear canonical correlation heterogeneity as fast as possible, being the goal to find subgroups of subjects with distinct non-linear canonical correlations. Unlike (Alakuş et al. 2021), we set up the random forest architecture in a non-linear environment, being a focus on the non-linear canonical correlation heterogeneity. The proposed splitting criterion is expressed by the following:

$$\begin{aligned} \sqrt{n_L \cdot n_R} \cdot |nr_L - nr_R |\end{aligned}$$

(1)

where:

$n_L$ is the size of the left node

$n_R$ is the size of the right node

$nr_L$ is the non-linear canonical correlation estimation of the left node

$nr_R$ is the non-linear canonical correlation estimation of the right node.

To define the non-linear canonical correlation, we consider a measure of similarity with respect a undefined variable x that identifies the concept described by a dataset H. The idea is to find the weight a that makes the weighted sums of variables as similar to x as possible. In this sense, the non-linear correlation is the Sum of Squares (SSQ) of the difference between the undefined variable x and the weighted sum of the dataset variables aH:

$$\begin{aligned} nr = SSQ(x - aH) \end{aligned}$$

(2)

The best split among all possible splits is the one that maximizes the formula (1).

4 In-depth pricing approach on the secondary market

In order to evaluate the NPLs portfolio price on secondary market sale, the DF-NLCC allows determining the projected RR. Nevertheless, to address our scope of computing the value of NPLs on the secondary market and the correspondent profitability for potential buyer, once estimated the RR value of a NPLs portfolio, we focus on the relevance of structural inefficiencies and information asymmetries driving a wedge between book values and market value of NLPs (European Central Bank (2016)). Indeed the bank that sells an NPLs portfolio tends to sell the worst loans it has, therefore it is reasonable to consider that the price proposed by the buyer should take this aspect into account. In light of these considerations, in this section we propose a pricing assessment approach adjustments in the secondary market, considering separately both the cost of capital of the booking in the balance sheet of the loans by the buyer and the weight in terms of information asymmetry between the bank and the potential buyer. Basically, the framework evaluation under consideration is built on the incidence of risky components of information asymmetry alike the classical balance sheet estimates.

Therefore, we distinguish between the first investor $I_1$ as the potential investor that ignores the information asymmetry at the time of transaction and assesses the NPLs portfolio on the basis of the cost of capital only, and the second investor $I_2$ as the potential investor that considers both the cost of capital and the information asymmetry.

The value of the NPLs portfolio (NPV) in the secondary market is defined as the present values of the expected cash-flows (CF) at a risk-free rate minus the Cost of Capital (CoC), defined as the by a 99.5$\%$ Value-at-Risk (VaR) of the predicted RR discounted at cost of capital.

$$\begin{aligned} NPV = \mathbb {E}[CF, i_{RF}(0,t)] - CoC \end{aligned}$$

(3)

Where $\mathbb {E}[CF, i_{RF}(0,t)]$ discounted at time $t = 0$ is the expected cash-flow net of the expected recovery rate with a spot risk-free rate structure.

$$\begin{aligned} CoC = {\mathbb {E}[CF, i_{RF}(0,t), VaR_{99.5\%}(RR)] - \mathbb {E}[CF, i_{RF}(0,t)]} i_{CoC} \end{aligned}$$

(4)

Where:

$i_{CoC}$ is the cost of capital rate of the investor in NPLs in the secondary market;

$\mathbb {E}[CF, i_{RF}(0,t)], VaR_{99.5\%}(RR)$ discounted at time $t=0$ is the cash flow net of recovery rate at probability level at 99.5$\%$.

To take into account the CoC, we quantify the spot rate $i_T$ that allows quantifying the net present value of CoC considering also the information asymmetry (IA):

$$\begin{aligned} i_{T + IA} = i_{RF + CoC + IA} \end{aligned}$$

(5)

$i_T$ is estimated by the following equation:

$$\begin{aligned} \mathbb {E}[CF, i_{T}(0,t)] = NPV \end{aligned}$$

(6)

Then, the market value of the NPLs portfolio negotiated in the secondary market will be:

$$\begin{aligned} \mathbb {E}[CF, i_{T}(0,t)] - C_{IA} \end{aligned}$$

(7)

where $C_{IA}$ is the cost of information asymmetry.

So, the $i_{T + IA}$ will be the solution of the following equation:

$$\begin{aligned} \mathbb {E}[CF, i_{T+IA}(0,t)] = \mathbb {E}[CF, i_{T}(0,t)]- C_{IA} \end{aligned}$$

(8)

5 Numerical application

In this section, we implement the DF-NLCC algorithm. Firstly, we generate a simulated dataset of 10,000 NPLs, following the iterative procedure of random variable parameter estimation described in Sect. 4. To simulate a portfolio of NPLs coherent with the characteristics of the Italian market, we refer to a technical report released by Cerved SpA (2020), that contains data about the duration of mandatory actions, divided by Regions, legal forms and business sectors, considering both default and voluntary liquidations. Furthermore, we consider data relating to recovery rate by waiting time of recovery, estimated by Fischetto et al. (2021), based on NPLs with collaterals. In this way, it is possible to estimate not only the recovery time by selecting only the NPLs with collaterals but also the relationship between the value of collateral and recovery rate.

Tables 1, 2, 3, 4 and 5 show the estimated parameters for distributions:

Table 1 Duration of the mandatory actions distribution parameters by region

Machine learning due diligence evaluation to increase NPLs profitability transactions on secondary market

Abstract

Similar content being viewed by others

Default or profit scoring credit systems? Evidence from European and US peer-to-peer lending markets

ML Algorithms for Providing Financial Security in Banking Sectors with the Prediction of Loan Risks

Application of Machine Learning Algorithms for Creating a Wilful Defaulter Prediction Model

1 Introduction

2 New legal framework or European Union directive on non-performing loans

3 Dependent forest based on non-linear canonical correlation (DF-NLCC): a specialised splitting rule

4 In-depth pricing approach on the secondary market

5 Numerical application

5.1 Profittability analysis: from the NPV to the market value

6 Conclusions

Data availibility

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 55 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation