1 Introduction

Machine learning advances heavily affected industry and academia in the past decades, ultimately transforming people’s daily life. Artificial Intelligence (AI) has been applied to almost every human activity, including pattern recognition, image classification, business, agriculture, transportation, and finance. This paper focuses on machine learning applied to finance and credit risk estimation. Modern financial systems rely on credit and trust. Credit risk is a fundamental parameter that measures and predicts the default probabilities of a debtor. The correct estimation of credit risk is paramount for the entire system. Failing in the credit risk estimation can lead to systemic failures such as the sub-prime crisis of 2008. Consequently, lenders devote large amounts of resources to predict the creditworthiness of consumers and companies to develop appropriate lending strategies that minimize their risks. Historically, credit risk approaches use statistical methods such as Linear Discriminant Analysis [1] and Logistic Regression [2]. These methods, however, do not easily handle large datasets.

Advances in computing power and availability of large credit datasets paved the way to AI-Driven credit risk estimation algorithms such as traditional machine learning and deep learning. Conventional machine learning techniques, e.g., k-Nearest Neighbor [3], Random Forest [4] and Support Vector Machines [5], are more effective and flexible than statistical methods. In particular, the vital branch of machine learning-deep learning techniques [6] applied to large credit risk data lake outperform their predecessors both in accuracy and efficiency.

This paper presents a systemic review of credit risk estimation algorithms. It analyzes both the major statistical approaches and AI-based techniques with a critical spirit. The aim is to provide a comprehensive overview of the current leading credit risk estimation technology, providing justification and connections between past and present works. This work proposes a novel taxonomy combining finance with machine learning techniques. In addition, this work ranks their performance in terms of accuracy and costs. This paper also discusses the challenges and possible solutions in terms of four aspects: data imbalance, dataset inconsistency, model transparency, and inadequate utilization of deep learning methods.

The remainder of the paper is organized as follows: the survey methodology will be discussed in Sect. 2. Section 3 introduces the principles of statistical learning, machine learning and deep learning. Section 4 analyzes credit risk-related applications in detail. In Sect. 5, presented algorithms are discussed and ranked by their performance against public datasets. Finally, results and current challenges are summarized in Sect. 6; while Sect. 7 concludes this work.

2 Survey methodology

2.1 Methodology

We applied PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Fig. 1) reviewing methodology in our paper. First, we adopted five searching platforms for our investigation: Google Scholar, ACM, IEEEXplore, Springerlink, and ScienceDirect. We used the keywords “machine learning” or “deep learning” combined with “credit risk” while searching. We got 2400 articles in total. Then, we applied a filtering algorithm considering the trade-off between publication year and citations to proceed. After removing 1400 duplicate records, 800 ineligible records, and 76 incomplete articles, we obtained 124 screened records. Based on the relevance, we excluded 24 articles less related to the topic. After manually checking whether the paper has clear evaluation metrics, we further excluded another 24 papers. Finally, we kept 76 studies in terms of the relevancy to the research topic, precision of evaluation metrics, publication time, and number of citations as our source of reviewing.

Figure1 depicts the PRISMA flow diagram

Fig. 1
figure 1

The PRISMA flow diagram. From: Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 2021;372:n71. doi:https://doi.org/10.1136/bmj.n71

2.2 Inclusion and exclusion criteria

In this paper, we select three inclusion criteria: (1) the relevance of research topic, (2) the precision of evaluation metrics, (3) the publication year and citations. Moreover, the papers will be excluded if they are duplicated, incomplete, too early, low-related with the topic, having no clear metrics or comparatively low citations.

We show the whole workflow of the selection process in Fig. 2.

Fig. 2
figure 2

The workflow of selecting papers

2.3 The datasets and approaches of the reviewed articles

The mainly used datasets by the papers under review are German and Australian public credit data from the UCI Machine Learning repository [7,8,9,10,11]. In addition, there exist some researches that discover and mine their own data. For example, Chee Kian Leong (2015) uses data from a firm in Singapore [12]. Authors in [13,14,15,16,17,18,19,20,21,22,23,24,25,26] all employ their unique dataset. Those articles mainly emphasize the significance and the veracity of the original data.

We discuss the principles and application of the overall machine learning approaches. The traditional machine learning models for credit risk contain Support Vector Machines (SVMs) [5], k-Nearest Neighbor (k-NN) [3], Random Forests (RFs) [4], Decision Trees (DTs) [27,28,29], AdaBoost [30], Extreme Gradient Boost (XGBoost) [31], Stochastic Gradient Boosting (SGB) [32], Bagging [33], Extreme Learning Machine (ELM) [34] and GA (Genetic Algorithm) [35]. Neural network models generally belong to deep learning methods. Most of them include Convolutional Neural Networks (CNNs) [36], Deep Belief Neural Networks (DBNs) [37], Artificial Neural Networks (ANNs) [38], LSTM (Long Short-Term Memory) [39], Restricted Boltzmann Machines (RBMs) [40], Deep Multi-Layer Perceptron (DMLP) [41], and Recurrent Neural Networks (RNNs) [42].

Summary tables and bar charts regarding all the methods of the reviewed papers are provided.

2.4 Taxonomy

The taxonomy is shown as Fig. 3. We can divide it into two parts: the first is regarding computing technology and the second is credit risk application domain. The two parts are further categorized into subsections. These two parts are connected and fused with each other. All the right-side sub-domains include the left-side techniques, and all the techniques can be applied in the financial domains.

Fig. 3
figure 3

The taxonomy of selecting paper

3 Computing approaches

This section briefly introduces three main computing techniques used for credit analysis, i.e., statistical learning, machine learning and deep learning, each of which has its own characteristics and similar principles. Statistical approaches are traditional ways to classify a customer’s or enterprise’s credit behavior. However, with the rapid development of artificial intelligence, machine learning and deep learning gradually took the place of statistical analysis.

3.1 Statistical learning approaches

We divide the statistical approaches into three subsections—discriminant analysis, logistic regression and Bayesian related model.

LDA (Linear Discriminant Analysis) is a classic technique for predicting groups of samples [1]. It aims at generating characteristics that can separate binary variables.

Logistic regression is a classification algorithm which uses the logistic sigmoid function to squash the output of the linear function into the interval (0, 1) and interpret that value as a probability [6].

Naïve Bayes methods are statistical learning algorithms that apply Bayes’ theorem with the “naïve” assumption of conditional independence between every pair of features if the class variable is given [43]. A Bayesian network is a probabilistic model based on graphs. It measures the conditional dependence structure of a series of random variables that comply with the Bayes theorem [44].

3.2 Machine learning methods

We review a series of conventional machine learning algorithms that can be applied well in credit risk area.

k-NN [3] belongs to classification methods that appoint the class of the majority of the k nearest neighbors of an input variable x to it in a dataset [3].

Tree-related methods show their effects in credit risk domain. Typical examples include DTs [27,28,29], Random Forests (RFs) [4], Classification and Regression Trees (CART) [45], C4.5 [46], and Diverse Ensemble Creation by Oppositional Relabeling of Artificial Training Examples (DECORATE) [47].

Support Vector Machine [5] implements a hyperplane (a decision boundary) which can separate classes in a high dimensional feature space. It outputs a class identity according to whether \(w^{T} + b\) is positive or not [6]. Here, w stands for the margin between the negative and positive hyperplane while b means the bias.

Boosting is an ensemble method that combines the individual models to gain higher capacity [6]. Adaptive Boosting (AdaBoost) belongs to the most popular boosting algorithms as the weights are re-assigned to each instance, with higher weights assigned to incorrectly classified instances [48]. SGB (Stochastic Gradient Boosting) [32] can add incorporate randomness as an integral part when created from Gradient Boosting algorithm. This family of algorithms includes Extreme Gradient Boost (XGBoost) [31] similar to Gradient boosting. However, it includes the decision trees built in parallel rather than in a series manners.

Bagging is an ensemble method which contains the same kind of model, training algorithm, and objective function in recycling [6]. It is also known as bootstrap aggregation [33].

Extreme learning machine [34] was developed by Guang-Bin Huang in 2006. It targets at building single-hidden layer feedforward neural networks (SLFNs) which randomly chooses hidden nodes and outputs the weights of SLFNs logically [34].

Genetic algorithm (GA) [35] is a heuristic search algorithm to solve searching and optimization problems. It first generates an initial population, then obtains a fitness score for all individuals in it. Individuals are selected for the reproduction of offspring [49].

3.3 Deep learning methods

Deep learning has deeper layers and more units within a layer compared with traditional machine learning. It can represent functions of increasing complexity [6]. In this section, we review some crucial deep learning methods used in credit risk.

Artificial Neural Networks [38] were inspired by a biological neural network system. It has three layers generally: an input, hidden and output layers. Given a feature vector x, the ANN outputs \(\hat{y}\) through the following formula [50]:

$$\begin{aligned} \hat{y}=a_{2}(a_{1}(\alpha ^{(1)}x+\alpha ^{(1)}_{0})\alpha ^{(2)}+\alpha _{0}^{(2)}) \end{aligned}$$

where \(\alpha _{0}^{(1)}, \alpha ^{(1)}, \alpha _{0}^{(2)}, \alpha ^{(2)}\) are weights and \(a_{1},a_{2}\) are activation functions.

Recurrent neural networks (RNNs) are a family of neural networks for processing sequential data [6]. They can better handle sequential information rather than the spatial data which Convolutional Neural Networks (CNNs) can effectively process. RNNs introduce state variables to store past information as well as the current inputs, both determining the current outputs [51].

LSTM [39] was first developed to produce paths in which the gradient flows for long durations [6]. It is the variant of Recurrent Neural Networks (RNNs). Compared with traditional RNNs, it can solve gradient disappearance and explosion in the long-term sequence process.

DMLP is a Multi-Layer Perceptron with multiple hidden layers. It is a directed neural network. In order to update the weights, the loss function for DMLP uses Softmax and Cross-Entropy. [50].

LeCun et al. first introduced CNNs [36] which were widely applied in image processing, voice recognition, automatic QA systems, and many other computing fields. CNNs consist of an input layer, convolutional layers, pooling layers and fully connected layers.

The convolution function is following [6]:

$$\begin{aligned} s(t)=\int _{-\infty }^{\infty }x(a)w(t-a)\,da \end{aligned}$$

where w(a) is a weighing function where a is the age of a measurement.

Hinton et al. introduced DBNs [37] which are a class of deep neural networks. A typical DBN consists of several hidden layers of Restricted Boltzmann Machines. An output of a lower level RBM can be regarded as input of the higher level RBM [50].

RBMs are some of the most common building blocks of deep probabilistic models. They are undirected probabilistic graphical models containing a layer of observable variables and a single layer of latent variables [6].

It has the similar energy function like Boltzmann Machine. The function is as follows [50]:

$$\begin{aligned} E(v,h)=-\sum _{i=1}^{n}a_{i}v_{i}-\sum _{j=1}^{m}b_{j}h_{j} -\sum _{i=1}^{n}\sum _{j=1}^{m}\alpha _{ij}v_{i}h{j} \end{aligned}$$

where \(a_{i}, b_{j}\) are biases for binary variables \(v_{i}, h_{j}\), and \(\alpha _{ij}\) are weights between j and i.

4 Credit risk application with computing algorithms

In the past decades, a lot of scholars have employed various computing algorithms and models to solve credit risk prediction and assessment. Binary classification problem is the most fundamental and essential computing technique in credit risk scenarios. In this section, we divide the related studies into two groups from the perspective of finance: consumer and corporate.

4.1 Consumer credit risk

Consumer credit scoring is one of the main parts of credit risk management. It is a kind of system which determines the creditworthiness of a customer based on his/her past credit situation. In [52], the Bayesian network method is improved to find out whether there is a change in credit risk profiles. Numerous approaches have been implemented in this domain. Typical examples include Extreme learning machines (ELM) [7], Ensemble of classifiers [8], Bayesian networks [12], Deep Genetic Cascade Ensembles [11], a hybrid model with convolutional neural networks and Relief algorithm [22], Genetic Programming [53], feature selection [54], RNN [55], ensemble of supervised learning and statistical learning [56], Radial Basis Function [57], TreeSHAP method for Stochastic Gradient Boosting [58], a real-time binary classification model [59], CNN [60], MLP [61], etc. The authors in [62] compared the traditional and machine learning models in the credit score evaluation area.

Predicting a consumer’s future credit condition is also valuable for credit risk and quantitative analysis. The authors in [19] conduct a comparison between deep learning techniques and other machine learning methods. It proves that XGBoost overperforms traditional machine learning techniques like Logistic regression, SVM and Random Forest. It turns out that a hybrid model is capable of predicting credit risk. In [63], a unique model named TRUST (Trainable Undersampling with SELF Training) was proved to be decisive.

CCF (Credit card fraud) is a specific crime in the banking system and becoming a substantially growing problem worldwide [64]. Detection of it helps to control the credit risk in banking security issue. A novel framework called DEAL (Deep Ensemble Algorithm) is employed [64]. Recurrent Neural Network (RNN) [65], Boosted Decision Tree [66,67,68], a deep learning structure with an advanced feature engineering [69] display a satisfactory performance. The authors in [70] conduct a comparison among Deep Learning, Logistic Regression and Gradient Boosted Tree. In [71,72,73], the authors implemented LR, SVM, k-NN, NB, RF, DT, MLP methods and found that they were all robust while tree-related models have the best performance. By using an auto-encoder, the authors in [70] create features with domain expertise. It is proved to be an improvement in predictive power. In [74], Visual Analytics were used to help reduce the incidence of false positives.

4.2 Corporate credit risk

Credit risk in corporate aspect also demands the necessity of machine learning and deep learning.

Deep learning plays a significant role in corporate credit rating and assessment. Two-layer additive risk model [13], Artificial Neural Network [15], LSTM and AdaBoost [9], denoising-based neural network [21], deep belief network [14, 75], probabilistic neural network (PNN) [76], Genetic algorithm with neural network [77], CNN [78] all show their great competency in estimation and assessing.

Online supply chain financial risk can be controlled by proper estimation and assessment. The authors in [24] construct a deep belief network based on Restricted Boltzmann Machine and classifier SOFTMAX. The dataset came from annual financial reports of Chinese notable companies. The model shows an accuracy which is far beyond SVM and Logistic Regression. In [79, 80], SVM and XGBoost were more accurate than LR and NB in supply chain fraud detection.

Because of the remaining effect of Global Financial Crisis in 2008, a large number of corporations are under the threats of bankruptcy. Neural networks can help those in danger detect the early signals of collapsing. A series of machine learning methods are enforced to predict bankruptcy [16, 81]. Bagging, boosting and random forest have the best performance. In [10], random forest trees are proven to outperform most of the other machine learning models.

In [26], statistical methods—probit models and CART (Classification And Regression Trees), machine learning methods—Neural Networks and k-NN are applied and compared to make a prediction in financial intermediary domain.

International finance, which has an important branch peer-to-peer lending, once flourished in the past decades. Normally, it has greater credit risk than common financial industry. Neural Networks [17, 82, 83], Attention Mechanism LSTM [20], word embedding models [84], Ensemble Learning Method [85, 86], Restricted Boltzmann Machine (RBM) [87] all exert their impact on predicting the risk of P2P industry.

Mortgage credit and prepayment risk are vital issues for measuring a borrower’s behavior in real estate financial industry. In [88], the authors find a highly nonlinear relationship between a borrower’s behavior and risk factors with deep neural networks. Deep learning is proved to be effective in measuring mortgage risks.

Big data technology triggered the massive transformation of finance. According to Denis Ostapchenya, a financial expert, big data in banking can be deployed to assess risks in the procedure of trading stocks or checking the creditworthiness of a loan applicant. Big Data analysis also accelerates and ensures the processes which require compliance verification, auditing, and reporting [89]. In the credit risk domain, the combination of machine learning, big data and specific financial techniques has achieved satisfactory results. BP neural networks, genetic algorithm [90], logistic regression with XGBoost and AdaBoost [91, 92], Synthetic Minority Oversampling Technique algorithm [93], integrated and mixed models [94, 95] all play a vital role in predicting and classifying credit risk assessment.

5 Performance ranking of machine learning techniques

5.1 Data imbalance

Generally, data imbalance often occurs in the credit risk classification due to the huge differences of the number of good borrowers and bad borrowers. SMOTE [93] is one of the most widely used approach to address this problem. In addition, over-sampling and under-sampling techniques are also employed. Nevertheless, data imbalance has been severely underestimated in many credit risk researches.

5.2 Evaluation metrics

In this review, we select ACC (accuracy) and AUC as main metrics for performance evaluation. The metric accuracy (ACC) is calculated through correctly classified values divided by the total number of samples while the metric AUC is the area under ROC curve which is also a measurement of precision of classification.

ACC is calculated as follows:

$$\begin{aligned} ACC= \frac{TP+TN}{TP+FP+FN+TN} \end{aligned}$$

where TP denotes true positive, TN stands for true negative, FP means false positive, FN denotes false negative.

AUC [96] can be expressed as the following formula:

$$\begin{aligned} AUC=\frac{1+\frac{TP}{TP+FN}-\frac{FP}{FP+TN}}{2} \end{aligned}$$

5.3 Ranking of techniques

There hasn’t been consensus on the specific ranking of each machine learning technique. In this section, we propose our own thoughts that is based on a thorough and objective investigation. Because the open-source databases of German and Australian credit risk have uniform judging criteria, we select the common techniques appearing in the related literature to compare their performances. We use the mean of each metric of the methods. The bar charts are shown in Fig. 4.

Fig. 4
figure 4

The accuracy from German credit data

The graph shows that machine learning methods have a higher accuracy universally than statistical methods. Bagging has the highest AUC and Random Forest (RF) has the highest ACC. Logistic Regression is the most powerful tool among the statistical methods in the credit risk classification. Naïve Bayes (NB), k-Nearest Neighbor (k-NN) and Classification and Regression Trees (CART) have comparatively low rankings regarding German credit dataset. The detailed ranking results are shown in Table 1.

Table 1 Rank from German credit data

Similarly, we sort and calculate the mean ACC and AUC appearing in the Australian Credit Risk dataset. The result is shown in Fig. 5. It turns out that the accuracy in the Australian dataset exceeds the one in the German dataset because the imbalanced ratio of German dataset is comparatively higher. The best AUC is contributed by ANN. The best ACC belongs to ELM method.

Fig. 5
figure 5

The accuracy from Australian credit data

From Fig. 5, we can conclude that deep learning methods are more potent than traditional machine learning and statistical methods from the above graph. The specific ranking is shown in Table 2.

Table 2 Rank from Australian credit data

In short, deep learning techniques have better performance regarding public credit risk data sets compared with machine learning and statistical learning methods based on ACC and AUC values.

6 Discussions

6.1 Existing survey papers

In this section, we review several typical surveys published recently. In [97], the majority of machine learning methods and data imbalance are discussed, but the discussion only focuses on the card defraud domain and the authors didn’t consider the synergetic effects of models. Xolani Dastile, Turgay Celik et al. [50] had a thorough investigation of systematic machine learning and its application in credit risk. Nevertheless, the role of deep learning models in credit risk hasn’t been fully expressed. In [98], principles of machine learning methods are not clearly displayed. In [99], abundant bibliography is shown. However, the structure of the paper is not balanced. Siddharth Bhatore et al. [100] displayed an intact review of machine learning in credit risk and showed clear graphs, but they ignored the limitation of datasets in some sense. In [101], similar problems with [98] occurred.

In our work, we give a comprehensive analysis and provide detailed comparison among methods, hoping to improve existing results.

6.2 The summary tables

We summarize our survey in the following four tables. A whole summary table is shown in S1 Table.

Table 3 shows that LR and Bayesian models are the mostly used ones among the statistical learning techniques.

Table 3 Papers containing statistical learning models

As shown in Table 4, we find that AdaBoost, SVM, Tree-related, k-NN and Bagging are the primarily implemented models among the machine learning techniques while SGB (Stochastic Gradient Boosting) and ELM (Extreme Learning Machine) have a relatively low citation.

Table 4 Papers containing machine learning models

Table 5 shows that ANN and MLP are the widely used deep learning models. Moreover, nearly all of the listed deep learning methods have a balanced citation distribution.

Table 5 Papers containing deep learning models

We list several important works containing unique datasets as Table 6. Almost all of them deploy their own computing models that improve the original algorithms. The results show that the models are effective.

Table 6 A summary table with unique datasets

6.3 Challenges

We summarize four major challenges in the research of machine learning-driven credit risk. First, data imbalance in credit risk is quite severe. Although several approaches such as over-sampling and under-sampling (usually chosen to under-sample the majority) have been proposed to solve this problem, the results are still unsatisfactory in terms of both effectiveness and efficiency. Second, the shortage of benchmark datasets is serious. Most existing works use private datasets, thus the results of performance comparison cannot be fair enough. Third, most machine learning models are black boxes since they are generally not transparent. Information transparency should be noticed. Fourth, the application of deep learning models is still limited in credit risk.

These four challenges are what we are supposed to overcome in future work. We hope more and more deep advanced models will emerge in this area.

7 Conclusions

In conclusion, we have witnessed an overall application of machine learning as well as deep learning methods in credit risk area. We build a taxonomy which links computing algorithms and finance. We also briefly introduce the principles of statistical and machine learning approaches. As for public datasets, we rank them according to their accuracy. In addition, we list some of the accuracy for the private and unique datasets. A checklist is provided in S2 Table.

The results show that deep learning methods are more powerful than the traditional machine learning and statistical approaches although they haven’t been fully employed. Also, the conclusion that ensembles of several methods outperform a single one has been proved in some of the related researches [9, 11, 75, 81, 103, 104].

In the future, we are supposed to find proper solutions to the challenges mentioned above. First, we should find new ways to tackle the problem of imbalanced data. Second, we will find a comprehensive judging criterion to make up for the default of specific methods and the inconsistency of datasets. Third, we should seek improvements in machine learning methods in tackling data transparency. Fourth, we should try our new and improved deep learning models in credit risk classification problem.

Moreover, in recent years, some authors proposed a series of representative nature-inspired metaheuristic algorithms such as (monarch butterfly optimization) MBO [105], (earthworm optimization algorithm) EOA [106], (elephant herding optimization) EHO [107], (moth search algorithm) MS [108], (Slime mould algorithm) SMA [109], (hunger games search) HGS [110], (colony predation algorithm) CPA [111] and (Harris hawks optimization) HHO [112]. They can also be applied in credit risk prediction. Besides, (Runge Kutta optimizer) RUN [113] is an algorithm that excludes the general characteristics of metaphor among other metaheuristic algorithms. Generally, those novel intelligent computational algorithms haven’t been sufficiently applied in finance due to the complexity and instability of risk related problems. However, they may have promising results when the analysis tools become more mature.

Last but not least, big data technology and its application in credit risk is a newly booming area. We will explore them and utilize the vast amounts and efficiency of big data tools like MapReduce and Hadoop platform to get better results.