1 Introduction

Driven by technological advancements, Peer-to-Peer (P2P) lending meets a growing demand for alternative financing and personal loans (Suryono et al., 2019). These loans are typically unsecured (De Roure et al., 2016) and lenders bear the risk of default. Most major P2P lending platforms set interest rates based on their internal rating models. On these platforms, accurate risk assessment is crucial: If the platform overestimates the risk of a loan, the borrower might be able to get a cheaper loan elsewhere. If the risk is underestimated, lenders might not be able to recover their investment in case of default and might not invest via the platform again. However, the lack of credit history and collateral makes it difficult to assess the creditworthiness of borrowers using traditional credit risk assessment methods (Bavoso, 2020).

Machine Learning (ML) is a promising technology that could prove key to more accurate risk evaluation (Zhang et al., 2015). While many studies have been conducted on the topic of credit risk assessment using ML, most focus on predicting binary default, rather than default timing. Moreover, ML is often seen as a “black box”—powerful at risk-classification, but not well suited for analyzing the economic impact of individual risk factors. Accordingly, this (perceived) trade-off between explanatory and predictive performance limits the usefulness of ML methods in risk management (Van Liebergen, 2017). However, recent developments in explainable ML demonstrate that the two are not necessarily at odds, with ML survival analysis methods identifying both known and novel risk factors for breast cancer survival (Liu et al., 2023; Moncada-Torres et al., 2021).

Our study applies this promising methodology to credit risk assessment in P2P lending. We use a dataset from the Estonian platform Bondora consisting of over 350,000 loans and 112 features with a volume of €915 million at the end of 2023. Using classical and extreme gradient-boosted Cox models, we predict the risk of P2P loans. Subsequently, we assign risk ratings using risk stratification. We show that the ratings based on the ML model (i.e., the boosted Cox model) significantly outperform both Bondora’s risk rating and the rating based on the classical model (i.e., the linear Cox model).Footnote 1 We also discuss the practical implications for risk screening, setting fairer interest rates and the potential profit opportunities for investors when using these models.

Then, we open the “black box” of ML by using Shapley Additive Explanations (SHAP) to explain and compare the models (Lundberg & Lee, 2017). Here, we examine the differences between the classical and ML Cox model and evaluate how the ML model can be used to identify risk factors in P2P lending. Our analysis uncovers risk factors that align with those identified in prior studies, while also unveiling novel relationships with nonlinear and interaction effects.

In summary, we demonstrate that with superior predictive and explanatory performance, explainable ML survival analysis is not only a useful tool for credit scoring but also for the examination of credit risk factors. The study is structured as follows. Section 2 presents the theory, reviews the existing research on risk assessment methodologies, and outlines the contribution and objectives of our study. Section 3 then introduces the datasets and our methodical approach.Footnote 2 Section 4 presents the model’s performance and explanations, which we interpret and discuss in Sect. 5, followed by Sect. 6 that concludes the study.

2 Theoretical background and literature

In the following subsections, we briefly review the literature on credit risk modeling in P2P lending and explainable ML. Furthermore, we identify the research gap and discuss the objectives and contribution of this study.

2.1 Credit risk modeling in P2P lending

As outlined in the introduction, accurate risk assessment is essential for platforms, borrowers, and lenders. However, a review by Suryono et al. (2019) underscores that risk assessment poses a major challenge in P2P lending due to large information asymmetry, lack of credit history, gender discrimination, and low loan success rates. The review further identifies ML methods and big data as potential solutions to these issues.

Investigating credit risk is a well-established field of research and has been studied extensively using binary classification. Binary classification aims to predict whether a borrower will default. It has been studied widely on P2P lending datasets using statistical (e.g., Emekter et al., 2015; Serrano-Cinca et al., 2015), ML (e.g., Jiang et al., 2018; Xu et al., 2021; Zhou et al., 2019), and rarely using explainable ML methods (e.g., Ariza-Garzón et al., 2020; Bussmann et al., 2021). In practice, like most studies, banks typically use binary classification to calculate the probability of default for credit scoring (Dömötör et al., 2023). While their exact method is not disclosed, Bondora’s ratings are based on expected loss, which also takes into account the likelihood of recovery after default (Bondora, 2023c).Footnote 3

Survival analysis offers an alternative approach that has some advantages over traditional classification methods: First, survival analysis takes into account the time duration until a loan defaults. This is not done when analyzing the default status alone. However, time to default plays a crucial role in return on investment and expected loss: With later defaults, the exposure at default and thus investment at risk is lower. This is particularly important in the case of fixed-rate loans, where the outstanding payments for investors are much higher when the loan defaults on the first payment as opposed to the last payment. Second, survival analysis also enables including loans in the training dataset that have not yet reached maturity through censoring. This can be a significant advantage, especially for long maturities, as it allows for the use of more recent data. A third advantage is that researchers can use survival functions to examine the influence of characteristics on solvency over time, which may provide insights into possible underlying causes. This is promising in practice, e.g., when evaluating loans in secondary markets, borrower characteristics may have a different effect after a certain period of time.

While statistical survival analysis models like linear Cox regression are used in some studies on P2P lending datasets (e.g., Emekter et al., 2015; Serrano-Cinca et al., 2015), few ML-based survival analysis studies exist (Suárez-Ramírez et al., 2022; Tan et al., 2019) and none of these use explainable ML methods. Nevertheless, when proposing a novel ML survival-analysis method (Bai et al., 2022) demonstrate that ML-based survival analysis can outperform statistical survival analysis methods in default classification.

2.2 Explainable machine learning and contribution

The scarcity of ML survival analysis studies in recent literature may be caused by seemingly conflicting goals of their methodologies: While popular survival analysis methods like linear Cox regression are commonly used for their interpretability in risk factor assessment (Emekter et al., 2015; Reichenbach & Walther, 2021; Serrano-Cinca et al., 2015), ML methods focus on predictive accuracy and are more difficult to interpret. Even the most precise predictions from a ML Cox model may not be very useful if they are not interpretable.

Recent research addresses this issue with the model agnostic and scalable explainable ML method SHAP (Lundberg and Lee, 2017; Lundberg et al., 2019, 2020; Mitchell et al., 2022). This method enables the explainability of ML Cox models, unveiling much more complex nonlinear relationships and interaction effects than the linear models could capture. This combination of methods yielded breakthrough results in clinical research, identifying both clinically confirmed and potentially novel risk indicators for breast cancer survival (Liu et al., 2023; Moncada-Torres et al., 2021) using explainable ML survival analysis.

To the best of our knowledge, explainable ML survival analysis has not yet been applied to P2P lending. This study aims to address this research gap by presenting explainable ML-based survival analysis as a useful tool for both credit scoring (i.e., classification using risk ratings) and the (inferential) analysis of different risk factors (e.g., debt-to-income, age, education) in P2P lending. Thus, we seek to contribute to both the literature on credit scoring techniques and the analysis of risk factors of individual borrowers.

Concerning the Bondora dataset used to test the methods, previous studies found room for improvement in credit risk scoring (e.g., Dömötör et al., 2023; Lyócsa et al., 2022; Teply & Polena, 2020), and signs for asset mispricing on the secondary market (Caglayan et al., 2020). However, as discussed above, the data has not been extensively studied using interpretable ML and survival analysis methods.

3 Data and methods

In this section, we briefly present the data and methods used in this study. For readers not familiar with survival analysis, ML, or SHAP, we recommend reading our more detailed introduction to these methods in Appendix B, where we introduce classical (Cox) models, their predictions and how these can be generalized using ML methods. We also explain how SHAP values are used to explain ML predictions.

The methodology of our study consists of 6 major steps: (1) preprocessing, (2) sampling, (3) training of the models, (4) rating assignment, (5) performance measurement and (6) analysis of risk factors. These are illustrated by Fig. 1 and are discussed in the following subsections after the presentation of the datasets.

Fig. 1
figure 1

Overview of the main preprocessing, training and evaluation steps

3.1 Datasets

We use two datasets provided by Bondora (2023a) for our analysis. These are updated daily and were last retrieved on January 3, 2024. The loan dataset contains data on all loans originated on the Bondora platform, including over 350,000 loans from Estonia, Finland, and Spain. Its 112 features provide details on the borrowers’ demographics, financials, and borrowing history, as well as risk ratings calculated by Bondora and information on the loan terms and outcome (see Appendix A for a full table with Bondora’s definitions). With loans ranging from 2009 until the end of 2023, the total cash volume of all loans sums up to 915 million Euros. The loans in the dataset have varying maturities, the most common being 5 and 3 years. Our study focuses on 3-year-loans as the duration is short enough, in light of the limited data timespan available, to allow for the splitting of training, validation, and test data in a temporal order. The fixed loan duration ensures comparability between borrowers when analyzing the risk factors using survival analysis (see Appendix B.1). The following data and plots address this subset.

Fig. 2
figure 2

Number of 3-year-loans by origination year and risk rating

As seen in Fig. 2 the majority of loans were originated after 2017. The overall risk structure of the portfolio appears to have shifted towards less risky loans after 2019. This may be explained by Bondora’s shift to hands-offFootnote 4 investment from 2016 to 2020 (Bondora, 2016). The lower amount of loans since 2020 may be due to the COVID-19 pandemic and a focus on higher-quality lenders.

In addition, we use Bondora’s repayments dataset (Bondora, 2023a), which contains all payments received by investors (over 6.2 million in total) to calculate the loans’ internal rate of return (IRR). In contrast to the return on investment, the IRR takes into account the time of payment and therefore indicates the actual rate of return realized by the investors. The results of this analysis are presented in Sect. 4.1.

3.2 Preprocessing

As seen in Fig. 1, we preprocessed the data for model training in 10 steps. To achieve better comparability, we train both models on the same datasets. As such, these steps follow the stricter requirements of the linear model by excluding missing values. Benefiting the interpretability of both models, we reduce high dimensionality, co-dependencies and sparsity of the data, leading to simpler models and thus simpler SHAP explanations.

Next, we present the preprocessing steps in more detail: First, we removed 37 features not available at the time of the auction to avoid target leakage.Footnote 5 This includes data about the loan status, secondary market, and debt collection process. To identify these features, we consulted the Bondora auction Application Programming Interface (API) documentation (Bondora, 2023b) and the Bondora website (Bondora, 2023a). We then removed another 16 features not available due to data protection laws after June 1, 2017 (e.g., private information like marital status and employment position, but also financials like debt-to-income (DTI)). Furthermore to avoid overfitting, we dropped features not relevant to the analysis, like the loan ID, loan number, username, and features about the exact timing of the listing like the payment day, listing time, weekday, and month.

As a fourth step, we modeled some features found relevant in prior literature that are not present in the data and added control variables: DebtToIncomeModeled measures the ratio between monthly income and monthly liabilities plus loan payments.Footnote 6RepaymentRatio measures the previous repayment amounts divided by the previous loan amounts.Footnote 7

Next, we removed features with a high percentage of missing values: DateOfBirth, City, and County were missing for all borrowers, likely because they were retracted from the public data. Additionally, we removed the country-specific credit scores from external rating agencies due to missing data (between 41 and 91%). This left us with a few features with less than 2% missing. For these, we removed data points with any missing values. This affected 2.85% of the remaining borrowers. For categorical features, we removed unused and extremely rare categories (less than 0.1% of the borrowers), i.e., the only remaining homeless and 41 Slovakian borrowers. Note that this step may introduce bias to the benefit of improved comparability between the models and could be skipped for the boosted model.

Then, we merged the two categories “income unverified” and “income unverified, cross-referenced by phone” due to a very low number of loans in the latter category (less than 1%). As a significant portion of Estonian borrowers is Russian-speaking, this population was separated from the other Estonian borrowers as “EE_Ru”. For other country codes, the language spoken was more homogenous, and other languages were too rare to draw any conclusions. After that, the features “Language” and “Country” were removed from the dataset and replaced by the new feature “Country_Lang”.

We renamed categorical features encoded as numbers (as seen in Appendix A) to strings to improve readability in the plots and remove false ordinality. Additionally, we renamed NewCreditCustomer to NewBondoraCustomer, as this reflects the meaning of the feature better, and corrected spelling mistakes. Furthermore, we use one-hot encoding for the categorical variables when fitting the linear models. One-hot encoding creates a new column for every categorical value (e.g., Rating_AA, Rating_A\(, \ldots ,\) Rating_HR). The new columns contain a 1 if the loan has the corresponding rating and a 0 otherwise. To avoid multicollinearity, we dropped the first category of each feature in these models.

Finally, we added two variables needed for survival analysis. That is the survival time (SurvivalTime) and an event indicator (Defaulted). In our analysis of the Bondora dataset, we define survival time and the default indicator as follows. Survival time is the time in days between the loan origination and the first of the following events: default (defaulted), end of the loan term (not defaulted), and end of the observation period (not defaulted), i.e., the split date for training and validation sets or the date of the report for the testing set. In other words, we investigate how long a subject can observably meet the loan’s conditions without default. Crucially, in the case of early repayment, the loan is marked as censored at the complete loan duration of 36 months as well.Footnote 8

3.3 Sampling

Then, we partitioned the data into training, validation, and test sets using temporal sampling. Temporal sampling ensures that the test data chronologically follows the training data. Both training and validation data are censored at the split date. This implies that, regardless of subsequent knowledge about a loan’s default post-split, such information is disregarded since it wasn’t available at the time of the split. This methodology mirrors a more authentic scenario, reflecting the real-world decision-making context where investors and platforms can only consider information from previously issued loans for risk assessment. Hence, we opted for temporal over random sampling.

The split date of January 1, 2020 was set based on the distribution of loans shown in Fig. 2, aiming to leave sufficient loans for training and enough completed loans in the test set in order to calculate the IRR correctly. The training set contains 90% of the loans between 2014 and the split date, while the validation set contains the remaining 10%.Footnote 9 The test set consists of all loans originated in 2020.

We trained the models on the training set, optimized the hyperparameters on the validation set, and evaluated them on the test set. We did not use the test set for any other purpose than evaluation. In contrast, we calculated the SHAP values on the training set to reveal potential overfitting.Footnote 10

3.4 Training

We used the python package lifelines (Davidson-Pilon, 2023) to train the linear Cox models. For the boosted Cox models, we used the GPU-accelerated XGBoost package (Chen & Guestrin, 2016). Refer to Appendix B for an in-depth introduction to Extreme Gradient Boosting (XGBoost) survival analysis.

We tuned the hyperparameters of the XGBoost model on the validation set using a Tree-structured Parzen Estimator hyper-parameter optimizer (Bergstra et al., 2011) implemented in the python package Optuna (Akiba et al., 2019).

3.5 Rating assignment

We assigned risk ratings to the loans to evaluate the models’ predictive performance and compare them to Bondora’s rating. For these ratings we used the predicted Hazard Ratios (HR) of the linear and boosted Cox models. To achieve similar sized rating groups, we determined risk intervals that partition the training set predictions into equally-sized groups. Every interval then represents a rating. This approach is similar to the one demonstrated by (Bai et al., 2022), however, they use the test set risk predictions to define the risk intervals. As we use temporal sampling to simulate a realistic application scenario, we need to base the intervals on the training dataset.

3.6 Model and rating performance measures

We measure both direct model performance and the more indirect performance of the model-derived ratings. While including both, we focus on the latter in results Sect. 4.1, as argued below.

As a direct performance measurement of the survival models, we use the concordance index (c-index) (Harrell et al., 1982; Uno et al., 2011). It measures the rank-correlation of the predicted survival times with the observed survival times, or in other words, the probability that a randomly selected pair of loans is ranked correctly by their survival time. However, the c-index is likely to be less relevant to investors and platforms, for whom the distinction between early and late defaults is more important than the exact order of survival times.Footnote 11

We therefore focus on the rating’s ability to distinguish between good and bad loans. Hence, the linear and boosted ratings (derived from the linear and boosted models, respectively) and the Bondora rating are evaluated on the test set by comparing the default rates of the resulting risk groups. We also calculate the average IRR for each risk rating to assess whether using the models to select loans would be profitable compared to using Bondora’s rating. The IRR is defined as the interest rate at which the present value of future cash flows equals the amount of the original investment (Dudley, 1972). Thus, as the name suggests, it can be interpreted as the investor’s annual rate of return (assuming the cash flows can be reinvested at the same rate). We use the xirr function of the “tvm” package to calculate the annual effective return of all completed loans (Truppia, 2023).

Additionally, we estimate the survival functions of the risk groups using the Kaplan–Meier estimator from the package Lifelines (Davidson-Pilon, 2023). The Kaplan–Meier estimator is a non-parametric estimator of the survival function in the face of censoring (Kaplan & Meier, 1958). The resulting survival function is a step function that decreases with every loan default and expresses the empirical probability of survival beyond each time step.

3.7 Analysis of risk factors using SHAP

Finally, we are interested in the practicability of using the models for an analysis of individual risk factors in P2P lending. For the linear Cox model, this is straightforward, as the coefficients of the Cox model directly indicate whether a feature increases (coefficient \(>0)\) or decreases \((<0)\) the default risk. The trained boosted models, however, are up to 10 layers deep and consist of up to 2594 trees. This makes them very difficult to interpret. However, the XGBoost package includes a method to explain the predictions of the model using SHAP values, which are explained in more detail in Appendix B.

Thus, to be able to compare the predictions of the linear and the boosted Cox models, we calculated the SHAP values for both models. For the linear model, SHAP values equal the product of it’s coefficients with the centered values of the variables (see Appendix B for a theoretical explanation). For the boosted model, we used the SHAP implementation in the XGBoost python package to estimate the values, including the SHAP interaction effects as developed by Lundberg et al. (2019). The resulting model explanations can be found in Sect. 4.2. In total, the final GPU-leveraged training of each model and calculating explanations took around four minutes.Footnote 12

4 Results

The results are divided into three subsections. First, we present the performance results of our two models and the ratings (derived from the model predictions). Second, we examine the most influential features (risk factors) and display the differences between the two models using SHAP feature dependence plots. Third, we present specific feature interaction effects.

4.1 Model and rating performance

In both Harrell’s c-index and the more conservative IPCW c-index, the ML models performed slightly better than the linear models (0.674 vs 0.659).Footnote 13

The loans in the test set (i.e., loans that originated in 2020) were rated using hazard ratio thresholds calculated on the training predictions (i.e., loans that originated between 2014 and 2019), as described in the methods and data sections. We use the same number of ratings as Bondora in the testing set for comparability (i.e., “AA” to “F”). Bondora rated no loans in the “High Risk” category since 2020.

The following plots show the Kaplan–Meier estimates of the survival probability over time of the risk groups for the different models. The shades around the functions are the 99.9% confidence bands of the empirical survival function estimates (Figs. 3, 4 and 5).

Fig. 3
figure 3

Kaplan–Meier survival probability estimate of the test set for ratings based on Bondora’s ratings. The step function shows the probability of survival of a loan in its respective risk group from the beginning of the loan term to a given point in time. The shaded areas surrounding the functions represent the 99.9% confidence intervals of the Kaplan–Meier survival function estimates

Fig. 4
figure 4

Kaplan–Meier survival probability estimate of the test set for ratings based on the linear Cox model. The step function shows the probability of survival of a loan in its respective risk group from the beginning of the loan term to a given point in time. The shaded areas surrounding the functions represent the 99.9% confidence intervals of the Kaplan–Meier survival function estimates

Fig. 5
figure 5

Kaplan–Meier survival probability estimate of the test set for ratings based on the boosted Cox model. The step function shows the probability of survival of a loan in its respective risk group from the beginning of the loan term to a given point in time. The shaded areas surrounding the functions represent the 99.9% confidence intervals of the Kaplan–Meier survival function estimates

The ratings based on the boosted model (Fig. 5) have the smallest confidence intervals and largest margins between risk categories, followed by the linear ratings (Fig. 4) and the Bondora ratings (Fig. 3). In the case of the Bondora ratings, multiple survival estimates cross, with the “A” group having a lower survival probability than the “B” and “C” group at the end of the distribution. For the ratings based on the linear Cox model, loans in group “D” end up having a higher probability of survival than loans in group “C”. For the boosted ratings, on the other hand, the groups are approximately proportional to each other and the survival curves show considerable margins between them after 36 months. These time-to-default estimates are mathematically tied with the default rates (the mortality rate at the end of the observation period equals the default rate) and thus IRR, which we present in the table below.

Table 1 Mean values grouped by Bondora and boosted Cox ratings

Table 1 shows the size of each group, the mean of the applied amount, IRR,Footnote 14 interest, and default rate for the Bondora and boosted Cox ratings. Despite the more than four times larger group size of the “AA” group of the boosted ratings, it has a lower default rate than Bondora’s (14.45% vs 17.26%) and significantly higher IRR (15.63% vs − 3.20%). Similarly, the ML model rated twice as many as F, at a higher loss (− 40.69% vs − 36.26%), again indicating better differentiation. In addition, the interest rates are not as strongly correlated with the boosted Cox ratings as they are with the Bondora ratings. For example, the interest rate in the top group of the boosted model is almost three times higher (27.77%) than the top group of the Bondora ratings (9.52%). Overall, the boosted Cox ratings have a more ordinal relationship with default rates and IRR. This is in line with the Kaplan–Meier estimates above (with B having a lower default rate (20.26%) than A (32.14%) for Bondora). Additionally, the boosted model has a significantly wider spread for the IRR between the rating groups (Boosted model: 15.63% to − 40.69% vs Bondora: 3.91% to − 36.26%).

4.2 Model explanations: SHAP feature attributions

This section presents the SHAP explanations of the Cox models using scatter plots for the most influential features. In these plots, each point consists of a borrowers’ feature value in the training set (x-axis), paired with the SHAP estimate of how that value impacted the models prediction (y-axis). E.g., for age, a point (52, − 0.2) would indicate that SHAP estimated a 0.2 reduced predicted logarithmic HR based on that borrowers age of 52.

Note that these estimates are additive, meaning that adding all SHAP estimates of one borrower across all features yields that borrowers predicted logarithmic HR. As the models predict the logarithmic HR of the borrowers, these plots allow us to interpret the estimated economic impact of a borrowers feature value directly: In non-logarithmic terms, this reduction by 0.2 equals multiplying the borrowers default hazard by \(82\%(\approx e^{-0.2}).\) For detailed explanations of Cox models and SHAP explanations, refer to Appendices B.2 and B.5.

We color-coded the linear SHAP values by the statistical significance of their corresponding Cox coefficient (full table in Appendix E),Footnote 15 as indicated by the color bars on the right. The vertical dispersion of the boosted SHAP values is explained by feature interaction effects. We will dissect these for selected features later.

Figures 6 and 7 show the SHAP values for the boosted and linear Cox model for three selected features from the demographics category, Country_Lang, Education Age. The full dependence plots for all 21 features in the categories demographics, financials, and borrowing history are presented in Appendix F.

Fig. 6
figure 6

SHAP values for the features Country_Lang (left) and Education (right). The blue dots represent the predictions of the boosted model, while the lines are the coefficients of the linear model, with their statistical significance indicated by the color bar on the right (colour figure online)

Country. The most impactful demographic feature is the country and language of the borrower. For this feature, the Shapley values of the linear and boosted Cox model appear to be quite similar in magnitude and direction, except for the centering difference. While the boosted model is centered on the average contribution of all feature values, the linear model is centered on the first feature value dropped in one hot encoding (i.e., EE). This results in slightly higher values for the linear model. The linear model’s SHAP values are equal to the coefficients of the linear Cox regression (see Appendix E).

Both models show that Estonian-speaking Estonians have the lowest risk, followed by Russian-speaking Estonians and Finnish borrowers. Spanish borrowers exhibit the highest default risk \((\text{HR} * e^{1.2} \approx 3.22).\)

Education. One of the more intuitive features in the demographics category is the educational level of the borrower. While the boosted model finds a monotonically decreasing risk with higher education, the linear model only finds a significant impact on risk for the highest educational level \((\text{HR} * e^{-0.2} \approx 0.82).\)

Fig. 7
figure 7

SHAP values for the feature Age. The blue dots represent the predictions of the boosted model, while the slope of the line is determined by the coefficient of the linear model. Its statistical significance is indicated by the color bar on the right (colour figure online)

Age. The linear model identified age as a significant risk factor to the \(p=10^{-2}\) level (blue line), and found that the risk decreased monotonically with age. For example, a borrower aged 60 \((\phi _{\rm age} \approx -0.04)\) would have around 92% the risk of a borrower aged 20 \((\phi _{\rm age} \approx 0.04)\) with \(\text{HR} = e^{-0.04}/e^{0.04} \approx 0.92.\) By definition, the slope of the line corresponds to the coefficient of the linear Cox regression (− 0.002) and has its x-intercept at approximately 40 years, which is the average age of borrowers. In the boosted model the risk of default also generally decreases with age. However, young borrowers between 20 and 25 and borrowers older than 60 have a higher risk than the linear model would predict.

While these SHAP dependence plots visualize the individual feature’s contributions, some features may interact with other features in the model, leading to vertical dispersion in the plots (e.g., for Country_Lang and Education). These can be investigated further using SHAP interaction values, as shown in the following section.

4.3 Model explanations: SHAP feature interactions

The SHAP interaction effects estimate how much the SHAP value of a feature changes when combined with another feature. Subtracting all interaction effects from the total effects results in the more focused main effect of a feature. This main effect is often more interpretable than the total effect, as it is not influenced by interaction effects and thus shows less vertical dispersion.

In the plots below, we present the main effect and the two most important interaction effects (which are our two examples from the last section, Country_Lang and Age) for the categorical feature with the most vertical dispersion, HomeOwnershipType. For the interaction effects, the dots in the scatter plots are colored by the value of the interacting variable. Here, red indicates a higher and blue indicates a lower value. The coloring is illustrated with the color bars on the right.

Fig. 8
figure 8

SHAP values for the main effect of HomeOwnershipType. The main effect is calculated by subtracting all interaction effects from the total effect (for HomeOwnershipType, see Appendix F)

Compared to the total SHAP value (see Appendix F), the main effect for the homeownership type in Fig. 8 shows less vertical dispersion.

Looking at the interaction with the country (and language for Estonians), as seen in Fig. 9, borrowers living with their parents or in furnished apartments are deemed less risky in Finland and Spain, while the opposite is true for Estonians. The inverse relationship is found for the other categories, except for owners. Here, Spanish borrowers are deemed more risky, followed by Estonians and Finns.

Fig. 9
figure 9

SHAP values for the interaction effect of HomeOwnershipType and Country_Lang. The dots are colored by the value of Country_Lang (color bar on the right) (colour figure online)

Fig. 10
figure 10

SHAP values for the interaction effect of HomeOwnershipType and Age. The dots are colored by the value of Age (color bar on the right) (colour figure online)

Figure 10 shows the interaction with age. The risk decreases for older borrowers most notably in the categories Mortgage and Owner, while increasing for LivingWithParents and TenantFurnished.

In summary, these interaction effects reveal more interpretable feature contributions on credit risk for less clear effects such as HomeOwnershipType. These interaction effects are derived from the boosted model without any manual modeling.

5 Discussion

The principal goal of this study was to evaluate the utility and performance of explainable ML survival analysis within the P2P lending sector, with a particular focus on its potential to improve risk assessment and risk indicator analysis compared to linear models.

5.1 Model performance

Our results show that the ML model outperformed the linear model. This applies to both Harrell’s c-index and the more conservative IPCW c-index. While the differences appear slight for the c-indices, the following results on practical rating performance demonstrate that the boosted model performed significantly better at risk assessment. For an in-depth analysis of the c-index results, including their limited significance in practice and further robustness checks, see Appendix C.

The boosted model achieved the best risk differentiation across the rating groups and vastly outperformed Bondora’s rating. The Kaplan–Meier survival estimates revealed that the boosted model was the only model with non-crossing survival curves ordered correctly by risk ranking (Fig. 5). Here, it had the largest margins between groups, the smallest confidence bands and also the largest span between the highest and lowest risk groups, indicating a superior ability to discriminate loans by risk. When looking at the tabular results including default- and IRR rate, compared to Bondora, the boosted model showed a more ordinal, even distribution and wider spread of the IRR between the rating groups from 15.63 to − 40.69% for the boosted-, and 3.91 to − 36.26% for Bondora’s ratings.

Investing an equal amount in boosted “AA” rated loans would have outperformed the Bondora-rated “AA” loans by almost 19% p.a. (IRR of 15.63% compared to − 3.20%). Moreover, the boosted “AA” rated loans were less risky, at a default rate of 14.45%, in contrast to 17.26% for the Bondora “AA” group. These results were especially striking when considering the risk-group sizes. The boosted model assigned more than four times as many loans as Bondora to the “AA” rating (713 vs 168). This favorable risk-return profile could be further enhanced by targeting mispriced loans (predicted low risk and high interest rates). Future research could investigate the model’s potential for identifying mispriced loans on the secondary market.

On the opposite spectrum, ML survival analysis ratings could aid in the screening of high-risk borrowers, addressing a major challenge for P2P platforms. The boosted model rated almost twice as many borrowers as F, returning − 40.69% annually compared to Bondora’s F group at − 36.26%. Screening out these high-risk borrowers would benefit investors, platforms and lower-risk borrowers alike, as argued in the theory section.

In addition, boosted survival analysis can enable fairer interest rates on P2P platforms. Multiple crossing survival curves and poor predictiveness of time-to-default indicated that the Bondora ratings may not be well-calibrated. This could cause unfair interest rates, as detailed in Sect. 2. If the interest rates on loans were fairly pricing risk, they should compensate for it. However, we found a different picture, e.g., the average interest rate of the boosted “AA” group (27.77%) was almost three times that of Bondora’s “AA”-group (9.52%), despite the boosted-“AA” group’s lower default rate. Moreover, we observed a low correlation between default risk and interest rates. The rating-performance differences may be attributed to the fact that Bondora’s ratings encode expected loss rather than expected time-to-default. However, expected loss is linked to expected time-to-default, as argued in the theory section (Sect. 2). Furthermore, the boosted ratings outperformed Bondora’s ratings regardless of the used performance measures (i.e., c-index, differentiation based on Kaplan–Meier curves, mean IRR and mean default rate). Hence, the interest rates may be set inaccurately and borrowers should get quotes from multiple platforms to promote fairer pricing.

While the boosted model outperformed the linear model, the linear model exhibited decent risk-rating performance, except for “C” having a significantly higher default rate (49.17%) than D (39.58%, Appendix D). We will address the model differences in the next section.

In summary, the application of boosted survival analysis to rate loans originated in 2020, utilizing historical performance data available to an investor by the end of 2019, demonstrated effective results. Our findings suggest that, within our dataset, boosted ML survival analysis models emerged as a promising tool for improving credit risk assessment, supporting investors’ decision-making, advancing risk screening and promoting fairer interest rates.

5.2 Model explanations

Another objective of this study was to investigate whether explainable ML survival analysis can be beneficial to risk indicator analysis in P2P lending. With SHAP explanation values, we explained and compared both linear and boosted Cox models using feature dependence plots (Sect. 4.2).

Overall, the boosted model extracted at least as much information from the data as the linear model. In our data, where the linear model identified significant effects, the boosted model consistently found similar effects (see Appendix F for a full display of all feature contributions). Unsurprisingly, these similar findings are mostly in line with prior research. To name a few: Borrowing risk reduced with age (e.g., Albert & Duffy, 2012; Kurnianingsih et al., 2015), male gender indicated higher risk (e.g., Lin et al., 2017), while having a mortgage reduced risk (e.g., Serrano-Cinca et al., 2015).

Furthermore, the boosted model uncovered more detailed and new relationships between borrower information and default risk. For example, all grades of education were well differentiated by the boosted model, with higher education decreasing risk. This is in line with prior research (e.g., Chen et al., 2018; Lin et al., 2017). In contrast, the linear model only found a significant impact on risk for the highest educational level, and no significance for the intermediate levels. The same applied for homeownershiptype (Fig. 12b), where the boosted model clearly distinguished LivingWithParents (higher risk), and Owner (lower risk) in addition to the categories identified significant by the linear model (Mortgage, lower risk and TenantFurnished, higher risk). While the higher risk for borrowers living with their parents makes intuitive sense (indicates lower assets), the lower risk for homeowners is supported by prior research (e.g., Serrano-Cinca et al., 2015). Furthermore, some variables impacted the boosted model where the linear model did not find significance at all. The relationship with DTI (Fig. 13b) found by the boosted model is a known factor in risk prediction (Emekter et al., 2015; Serrano-Cinca et al., 2015) and was not identified by the linear model.

In addition to the more detailed linear effects, the boosted model found both known and novel non-linear and interaction effects exploratively. For example, credit risk decreased sharply for younger borrowers with age, flattening out until the age of 55 years and then increasing slightly again. As mentioned, research confirms this general trend of the models, as risk aversion increases with higher age (e.g., Albert & Duffy, 2012; Kurnianingsih et al., 2015). Additionally, evidence points towards a nonlinear and domain-specific relationship between risk-taking and age, with increased risk-taking in early adulthood (e.g., Rolison et al., 2014; Willoughby et al., 2021), which was also picked up by the boosted model. On top, the newly quantified risk increase for borrowers at 55 years may be explained by lower income due to retirement, increased medical costs and reduced life expectancy. Analyzing interaction effects also unveiled novel risk indicators and helped us dissect the vertical dispersion of primary effects. For HomeOwnershipType, we found that tenants living with their parents or in a furnished apartment were deemed less risky when younger (Fig. 10), which intuitively makes sense. The interaction with the country (Fig. 9) indicates that the model may be able to account for systemic and cultural differences. To the best of our knowledge, these particular interaction effects have not been identified exploratively in prior research. Especially in the European market, analyzing cultural and systemic differences (e.g., due to different retirement systems) might be valuable for credit risk assessment.

Finally, one advantage of the classical model is that it can test for statistical significance. However, in the cases where the ML-SHAP values were less dense, the linear models often found the feature value to be less or even insignificant (white coloring). This indicates that SHAP feature importances (average absolute SHAP values) may be consistent with statistical significance, which was also found by Bussmann et al. (2021).

Overall, the boosted model not only found the same significant risk factors as the linear model but also found more detailed and even new relationships between risk factors and default risk. This included nonlinear and interaction effects that are not quantifiable in linear models in the same explorative way.Footnote 16 This is in line with prior research on this methodology in oncology (Moncada-Torres et al., 2021), supporting the argument that explainable machine learning survival analysis can reveal known and potentially novel risk factors in risk research.

5.3 Limitations and implications

Further research is required to test these methods more widely. While we were able to achieve exceptional results on the Bondora dataset, future studies should test the reliability of the models by applying them to different datasets and split dates. For the latter, we suggest a rolling window approach, as this would allow the adaptability and validity of the models to be tested. Here, the US Lending Club dataset could be suitable. The lending club dataset is widely used in prior research and has a much larger number of completed loans ranging over a longer period. The Bondora dataset did not allow for this at the time of writing due to the limited available observation period combined with the need for completed loans to calculate the IRR. However, it can be revisited in the future, as the dataset grows and more loans are completed. For our purposes, to assess risk rating ability and risk-factor analysis the dataset we used is well-suited. Especially for the latter, as seen in the explanations, the dataset provided valuable hypotheses into cultural differences important for risk assessment in the European P2P market (this insight would not be possible on the US Lending Club dataset). Furthermore, the good performance of the ML models on the test set despite the changed risk profile (changes in rating distribution in Fig. 2) indicated that the models are likely robust and can be used for future loans. As discussed in more detail in Appendix C, we also applied random sampling and rated 5-year loans, which resulted in even higher performance for the ML models and qualitatively similar results for the SHAP values. Additionally, simply removing borrowers with missing and sparse values from the dataset can introduce bias to the model. This is especially problematic if the missingness is systematic and should be investigated to avoid bias against demographics (e.g., in our case, the excluded homeless and Slovakian borrower). While our decision to compare the models lead to following the more restricted preprocessing steps of the linear model, the boosted model can be trained (and predict) on incomplete and sparse datasets in future use.

Furthermore, not all of the explained relationships, as shown in Appendix F, were straightforward. Some required further investigation (e.g., using interaction effects), and some contradicted intuition. For example, we found a risk decrease with larger liabilities and an increase with larger income. We suspect that this is caused by collinearities, e.g., with DTI.Footnote 17 This indicates that explainable ML findings need to be validated, as Shapley values explain the model, and not the data directly. If the model finds complex mathematical relationships that obscure the true underlying risk factors, explanations may be incapable of identifying meaningful risk indicators that are intuitive, straightforward, and supported by theory. Developing a model with useful explanations may require some trial and error—especially through regularization and adequate data prepossessing, including feature selection and modeling of variables likely relevant in reality. Nevertheless, spurious relations are not only found in ML models but also in linear models. To overcome this issue, a hypothesis-based approach is commonly used with linear models. For explainable ML methods matching quantitative with qualitative theory is equally important, although its explorative approach allows deriving hypotheses directly, which should be verified with existing theory.

6 Conclusion

In this study, we used classical and ML survival analysis to predict default risk in P2P lending using a European dataset with over 350,000 loans. We compared the performance of the models in rank correlation, classification, and credit risk rating. We then opened the ML models’ black box using SHAP to explain the performance differences and identify credit risk indicators.

Our results demonstrated that ML survival analysis performs exceptionally well in credit scoring, significantly outperforming the platform’s risk ratings. For investors, our ratings revealed a profit opportunity through targeting high-interest loans with low estimated risk. On the platform side, more accurate risk assessments could promote fairer pricing and improve the screening process, ultimately reducing overall risk and increasing portfolio performance.

Using SHAP, we were able to explain the models decision making and discover both novel and known credit risk factors. This yielded compelling and intuitive hypotheses, establishing promising avenues for future research.

Altogether, the methodology’s exceptional performance results, combined with its meaningful explanations, confirmed its ability to improve credit-risk ratings through more accurate while transparent credit risk assessments. With analogous findings in oncology that validate explainable ML survival analysis’ ability to generate knowledge, we are confident that this approach can further the understanding of time-to-event data across various domains. This methodology could spark a new wave of survival analysis research, including the reexamination of studies previously conducted with linear survival analysis methods.