Introduction

China's auto finance market started relatively late, and the idea of buying a car by installment only appeared in 1993. In 1998, the Government introduced a policy of encouraging automobile consumer loans, effectively kick starting China's automobile finance market. By 2018, China's automotive finance market was 139 million yuan, a growth of 19.2%. With better personal credit information, this market is set to grow. According to China’s banking regulatory commission, from 2013 to 2017, the compounded annual growth rate of outstanding loans in China's auto financing business was as high as 29%. By the end of 2017, the loan balance of the auto finance business in China had reached 668.8 billion yuan, an increase of 28.39% year-on-year. Today, the auto finance industry accounts for an increasing proportion of the overall personal credit and finance industry, and its influence on China’s economy is also increasing, together with the accompanying financial credit risks. Two factors compound this phenomenon: greater lifestyle consumption and easier access to online finance [1, 2]. Indeed, the auto finance industry has many advantages, such as a flexible credit verification process and simpler vetting procedures, compared to the traditional financial institutions [3,4,5].

At present, a variety of auto finance products is available in the market, such as the highly popular P2P online auto finance or the micro-loan network. In the tide of the Internet, various companies are trying to attract consumers with new technologies and new models, hoping to take the lead in the field of auto finance. All these signs indicate that the auto finance industry will develop rapidly in the future. In the process of such development, not only brought a lot of benefits, but also brought some disadvantages. Because the auto finance industry is characterized by high risks and high returns, if there are no effective industrial control measures, then it is impossible to develop into a sustainable and healthy auto finance industry [6,7,8,9]. At present, China's auto finance industry is at a preliminary stage of development, in which there are still many problems, such as the imperfect establishment of personal credit investigation system, inadequate laws and regulations, and inadequate risk supervision and management, which all show that the credit risk problem is particularly important. Therefore, scientific and effective control of corporate credit risk has become an important issue in the development of China's auto finance industry.

The main participants in China's auto finance industry are auto finance companies, financial leasing companies, internet finance companies and banking institutions, etc., and the whole industrial chain is relatively complete. However, in terms of the market share in China, auto financing companies occupy the main part of the market, and most Chinese consumers will choose the financial products of auto financing companies. As the economic main body of the subsidiary products of automobile industry, the auto financing company mainly deals with the business of automobile consumer finance loan. However, in the process of operation, the auto financing company not only pursues its own profits, but also undertakes the task of providing consumer loans to customers to promote the sales of vehicles.

Compared with banks, auto financing companies bear more credit risks due to their specific business purposes. Due to various institutional defects in current business model, it is impossible to form sound and effective risk management measures. In addition, the imperfection of credit system and industrial factors such as the fluctuation of automobile prices will lead to a large number of bad debts in the auto finance industry [10,11,12]. Under this background, the financial institutions have suffered significant losses due to vehicle loan defaults, auto underwriting has been tightened and the rejection rate of the auto loans has increased. The credit institutions have demanded for the credit risk assessment of customers using rigorous credit risk assessment models to predict the probability of loanee/borrower accurately defaulting on a vehicle loan in the first EMI (Equated Monthly Installments) on the due date, so as to identify customers with high credit risk and further reduce the default rate. Moreover, doing so will ensure that clients capable of repayment are not rejected and important determinants can be identified which can be further used for minimizing the default rates. From above research motivation, this paper will study how to establish a credit risk assessment model for auto financing companies with high classification and prediction accuracy, so as to not only guarantee their own earnings, but also control the bad debt rate generated by credit. This has important practical significance for auto financing companies and even the whole auto financing industry.

Compared with the existing congeneric methods for the credit risk assessment of personal auto loan, this paper makes two contributions as follows.

  1. (i)

    First, to reduce the feature dimension, enhance the generalizability of the model, and reduce the possibility of overfitting, based on the 45 preliminary indexes and the limitations of the current single feature selection method, this paper has proposed an improved Filter-Wrapper feature selection method by combining Filter and Wrapper. In the Filter stage, three evaluation criteria including the Relief algorithm, Maximum Information Coefficient method, and Quasi-separable method are selected. Then, the order relation analysis method is used to determine the corresponding weights of the three evaluation criteria, and a fusion model of multiple evaluation criteria is constructed to comprehensively rank the feature importance. In the Wrapper stage, the RF is selected as a classifier and the SBS method is used to screen the final optimal feature subset, thus effectively improving the classification accuracy of subsequent models.

  2. (ii)

    Second, most scholars study the credit risk assessment in the traditional financial field, but there are few researches on the credit risk assessment of personal auto loan in the auto finance industry. In today's internet era, China's auto finance industry is developing rapidly, and it is necessary to study the increasingly prominent credit risks of auto loans in its development process. Based on this, this paper proposes a PSO-XGBoost model for the credit risk assessment of personal auto loans, which is novel for the research on the credit risk assessment of auto loans in China's auto finance industry. To evaluate the performance of the models, the PSO-XGBoost model is compared against the XGBoost, RF, and LR models on performance evaluation indexes such as accuracy, precision, ROC curve, and AUC value. The results inform the PSO-XGBoost model to be more superior to the other models on classification performance and classification effect. This validates the choice of the PSO-XGBoost model in the credit risk assessment of personal auto loans.

This paper is organized as follows. Section “Literature review” surveys the literature. Section “Data preprocessing and unbalanced data set transformation” presents the data preprocessing and the transformation of the unbalanced data set. Section  “Data preprocessing and unbalanced data set transformation” proposes a filter-wrapper feature selection method to select the credit risk assessment indices for the personal auto loan. Section “Feature selection method of credit risk assessment index” presents a PSO-XGBoost model for the credit risk assessment of the personal auto loans and the accompanying empirical analysis. Section “Credit risk assessment of personal auto loans using PSO-XGBoost model” concludes the paper.

Literature review

Adopting appropriate feature selection method to remove redundant features and reduce the dimension of data can effectively improve the computational speed and classification performance of the algorithm. Therefore, feature selection is indispensable in processing massive data. Currently, the popular feature selection methods include Filter [13], Wrapper [14], Modified-Dynamic Feature Importance based Feature Selection (M-DFIFS) algorithm [15], Mean Fisher-based Feature Selection Algorithm (MFFSA) [16], Markov Blanket-based Universal Feature Selection [17], Improved Binary Global Harmony Search (IBGHS) [18], MCDM-based method [19], joint semantic and structural information of labels [20], and the fast multi-objective evolutionary feature selection algorithm (FMABC-FS) [21].

The Filter method is simple and feasible and research have developed evaluation criteria, such as the Relief algorithm, Maximal Information Coefficient method, and the Information Gain method. The Relief algorithm is based on the feature weight proposed by Kira [22], which assigns weights to the features according to their ability to distinguish the samples. Then, the weight is compared with the threshold value. If the weight of the feature is less than the threshold value, it is deleted. In applying the Filter method, Ma and Gao [23] employed a filter-based feature selection approach using Genetic Programming (GP) that uses a correlation-based evaluation method, and their experiments on nine datasets show that features selected by feature construction approach (FCM) are able to improve the classification performance when compared to the original features. Thabtah et al. [13] proposed a simple filter method to quantify the similarity between the observed and expected probabilities and generate scores for the features. They report that their approach significantly reduces the number of selected features on 27 datasets. The Wrapper method takes the accuracy obtained by the subsequent learning algorithm as the evaluation criterion. Compared to the Filter method, the Wrapper method is computationally complex with low operation efficiency albeit high accuracy. Gokalp et al. [24] proposed a wrapper feature selection algorithm using an iterative greedy metaheuristic for sentiment classification. Khammassi and Krichen [25] presented a NSGA2-LR wrapper approach for feature selection in network intrusion detection. González et al. [26] applied a new wrapper method on feature selection, based on a multi-objective evolutionary algorithm to analyze the accuracy and stability for BCI. Mafarja and Mirjalili [27] proposed a wrapper feature selection approach based on the Whale Optimization algorithm.

The single feature selection method is often not comprehensive, while the Filter method and Wrapper method have their own merits and drawbacks. As such, some studies combine both methods, and proposed fusion feature selection methods by treating a combination of a variety of evaluation criteria. For example, Rajab [28] analyzed the advantages and disadvantages of the Information Gain (IG) algorithm and Chi-square (CHI) algorithm, and then used them in combination. Solorio-Fernández et al. [29] presented a hybrid filter–wrapper method for clustering, which combines the spectral feature selection framework using the Laplacian Score ranking and a modified Calinski–Harabasz index. Rao et al. [30] presented a two-stage feature selection method based on the filter and wrapper to select the main features from 35 borrower credit features. In the Filter stage, three filter methods are used to compute the importance of the unbalanced features. In the Wrapper stage, a Lasso-logistic method is used to filter the feature subset using a search algorithm.

Thus, following the earlier works, this paper combines the Filter and Wrapper methods to propose an improved Filter-Wrapper two-stage feature selection method to select the credit risk assessment indexes of the personal auto loans. However, compared to the existing fusion approach of the Filter and Wrapper methods, our two-stage feature selection method is different on the following aspects. In the Filter stage, we consider the aspects of information relevance, amount of information, and quasi-separable ability to, respectively, select three evaluation criteria, i.e., Relief algorithm, Maximal Information Coefficient method and Quasi-separable method to evaluate the importance of the features. A fusion model of multiple evaluation criteria is then constructed to rank the importance of the features. In the Wrapper stage, the Random Forest (RF) is selected as the classifier; the classification accuracy is used as the measurement standard, and the Sequence Backward Selection (SBS) method [31] is used for the feature selection. Based on the classification accuracy, the quality of the corresponding feature subset is evaluated, and the optimal feature subset is selected as a result of the evaluation indexes for the credit risk assessment of the personal auto loans.

Auto finance credit stems from consumer credit finance, notably on individual credit risk assessment. The traditional analysis methods, such as 5C and LAPP, are subjective, and highly dependent on expert experience. Then, in switching to mathematical models to analyze credit risk assessment. Durand [32] was the first to use discriminant analysis to assess individual credit risk. With the advent of better computing power and the availability of massive data sets, artificial intelligence methods such as machine learning, data mining, and deep learning have emerged.

However, traditional statistics, non-parametric statistics, machine learning, and data mining have been applied separately to credit risk assessment. With these single technique methods, there are often problems associated with low precision prediction, model overfitting, and low algorithm efficiency. Therefore, research have since combined statistical methods with artificial intelligence methods such as machine learning and data mining to address those shortcomings when applied to individual credit risk assessment. For example, Yu and Wang [33] proposed a kernel principal components analysis based least squares fuzzy support vector machine method with variable penalty factors for credit classification, and conducted an empirical analysis to prove the effectiveness of the model. Combining decision tree theory with machine learning methods, Rao et al. [34] selected a loan data set on the Pterosaur Loan platform, and used a 2-stage Syncretic Cost-sensitive Random Forest (SCSRF) model to evaluate the credit risk of the borrowers. Further, Lanzarini et al. [35] combined particle swarm optimization with competitive neural networks to propose an LVQ + PSO model, to predict the credit customer’s loan situation. Barani et al. [36] proposed a new improved Particle Swarm Optimization (PSO) combined with Chaotic Cellular Automata (CCA). Similarly, Mojarrad and Ayubi [37] proposed a novel approach in particle swarm optimization (PSO) that combines chaos and velocity clamping with the aim of eliminating its known disadvantage that enforces particles to continue searching in search space boundaries. However, as the credit datasets are typically high-dimensional, class-imbalanced, and are of large sample sizes, Liu et al. [38] recently proposed an Evolutionary Multi-Objective Soft Subspace Clustering (EMOSSC) algorithm for credit risk assessment. Luo et al. [39] employed a two-stage clustering method using kernel-free support vector machine, and applied the method incorporating t-test feature weights for credit risk assessment.

While there is rich research on personal credit risk assessment, particularly on optimizing the performance of the current credit risk assessment models by either improving or combining statistical methods with artificial intelligence to obtain better prediction, there is little literature on the credit risk assessment of personal auto loans in the auto finance industry. In this paper, we study the problem of the credit risk assessment of personal auto loans, and combine Particle Swarm Optimization (PSO) with the XGBoost model to form a PSO-XGBoost model to evaluate the credit risk of personal auto loans. We validate the PSO-XGBoost model against three evaluation models (XGBoost, RF, and LR).

Data preprocessing and unbalanced data set transformation

To study the current credit risk problem in the auto finance industry and to reduce the loan default rate of the auto financing institutions, we select the data set of personal auto loans on the Kaggle platform as the research samples. The data set is first preprocessed, and transformed according to the specific indexes to construct an overall index. Next, based on the description and value range of the indexes in the data set, the credit risk assessment indexes are preliminarily pre-screened. The unbalanced data set is processed and transformed into a balanced data set.

Data preparation and preprocessing

The details about the selected datasets we studied in this paper are as follows. One of the selected datasets is a data set of personal auto loans about the auto financing institutions that is available on an open data platform, that is, the Kaggle platform. The data set can be downloaded from https://www.kaggle.com/mamtadhaker/lt-vehicle-loan-default-prediction.

The selected data set contains 233,154 customer loan records, where 182,543 loan records representing the set of non-defaulting customers, and 50,611 loan records representing the set of defaulters. In addition, the data set contains 41 indexes, where 40 indexes are the independent variables, which are used to predict a borrower’s loan default. The following information regarding the loan and loanee are provided in the 40 indexes: loanee information (Demographic data like age, Identity proof etc.), loan information (Disbursal details, loan to value ratio etc.), bureau data & history (Bureau score, number of active accounts, the status of other loans, credit history etc.). These indexes reveal the borrower's personal information, economic health, and credit history. Another index, loan_default, is used to mark whether the borrower has defaulted, and this is labeled the dependent variable. This index divides borrowers into binary categories: “0” to denote the non-defaulters, and “1” to denote defaulters. As for the missing data in the data set, there is none except the index “Employment.Type”. Table 1 provides a description of the notation used.

Table 1 Notation and description

Table 1 shows four types of data in the data set: integer, floating point, date, and character type. Among them, index X4 is a floating point type, X9 and X11 are date types, X21, X38, and X39 are character types, and the other indexes are integer types. The date type and character type data cannot be used directly, so data cleansing and data conversion are needed. Data cleansing mainly deals with data exceptions, including missing values processing, type values processing, exception point processing, and outliers processing. Data conversion is to enhance data processing through data discretization, data specification or by creating new variables.

(1) Date processing.

(i) Type values processing.

In the data set, index X9 (Date of birth of the customer) and X11 (Date of disbursement) are date type indexes, which are processed as follows. The date of birth of the customer is converted to the current age, and the date of disbursement is converted to the number of months from the current time. For the character type indexes X38 (Average loan tenure) and X39 (Time since first loan), their index values are converted to the number of months. For index X10 (Employment type of the customer), the Self Employed type is denoted as 0, and the Salaried type is denoted as 1. There are missing values in this index X10. In addition, there are 20 components in index X21 (Bureau score description), which are converted using the literal meaning of the description. Table 2 contains the specific conversion results.

Table 2 Risk transformation of bureau score description

(ii) Exception point processing and outliers processing.

In looking for outliers in the data set, we note that some age values of index X9 (Date of birth of the customer) are less than or equal to zero, which is implausible. Hence, we replace them with null values and treated them as missing values. Also, for the indexes X25 (Total Principal outstanding amount of the active loans at the time of first disbursement), and X31 (Total Principal outstanding amount of the active loans at the time of second disbursement), some of their index values are less than zero, which are invalid, and hence replaced with null values.

(iii) Missing values processing.

The objects with a null value in index X9 (Date of birth of the customer) are filled with the values of the mean age. For the missing values in index X10 (Employment type of the customer), the RF machine learning algorithm is used to fill them. The employment type of the borrower is taken as a dependent variable; the other indexes are treated as independent variables. The existing employment type data are trained in the random forest, to classify and predict the unknown employment types.

(2) Data transformation.

As the data set contains many indexes with the same meaning occurring at different times (the first and second times), notably, indexes X22 and X28, indexes X23 and X29, and indexes X23 and X29, we merge the indexes with the same or similar meaning to yield composite indexes, as shown in Table 3.

Table 3 Composite indexes

The approach for merging the indexes in Table 3 is as follows. The indexes loan_to_asset_ratio, Total_no_of_accts, Pri_inacitve_accts, Sec_inactive_accts, Total_inactives_accts, Total_actives_accts, Total_current_balance, Total_sanctioned_amount, Total_disbursed_amount, Total_instal_amt, Pri_loan_proportions, Sec_loan_proportions, and Active_to_inactive_act_ratio, are denoted by X41, X42, X43, X44, X45, X46, X47, X48, X49, X50, X51, X52, and X53, respectively, and their index values are as follows:

$$ \begin{gathered} X_{41} = \frac{{X_{2} }}{{X_{3} }},\;X_{42} = X_{22} + X_{28} , \hfill \\ X_{43} = X_{22} - X_{23} ,\;X_{44} = X_{28} - X_{29} , \hfill \\ X_{45} = X_{22} - X_{23} + X_{28} - X_{29} ,\;X_{46} = X_{23} + X_{29} , \hfill \\ X_{47} = X_{25} + X_{31} ,\;X_{48} = X_{26} + X_{32} , \hfill \\ X_{49} = X_{27} + X_{33} ,\;X_{50} = X_{34} + X_{35} , \hfill \\ X_{51} = \frac{{X_{27} }}{{\left( {X_{34} + 1} \right)}},\;X_{52} = \frac{{X_{33} }}{{\left( {X_{35} + 1} \right)}}, \hfill \\ X_{53} = \frac{{\left( {X_{22} + X_{28} } \right)}}{{\left( {X_{22} - X_{23} + X_{28} - X_{29} + 1} \right)}}. \hfill \\ \end{gathered} $$

Creating the new composite indexes yields 54 indexes in total. Of these, 53 indexes are independent variables related to the borrower’s information and one index is the dependent variable. From the data, there are 12 indexes with no 0 values, namely X1, X2 X3, X4 X5, X6, X7, X8, X9, X11, X12, and X13, but there are many 0’s in the index values of the other 42 indexes. Hence, if more than three-quarters of the index values of a record are zero, then the record is deemed invalid and deleted accordingly. As such, only 117,156 loan records can be used for research and analysis.

Pre-screening credit risk assessment indexes

The credit risk assessment indexes of the personal auto loans are generally divided into three categories, i.e., personal indexes, economic indexes, and credit indexes. Personal indexes generally reveal the basic information of the borrower, such as age, gender, job, and education, which can be used to predict the change in a borrower’s loan repayment behavior. Economic indexes reflect the economic standing of the borrower. The better the economic standing, the less is the likelihood to default. Credit indexes reflect a borrower’s credit history, including the credit data generated in their life, work, and so on. This information can be used to understand the borrower's credit history of repayment, the borrower's repayment willingness, and can be used to predict future repayment behavior changes.

From the description and value range of the indexes in the data set, it is easy to infer if an index is a credit risk factor. For index X8 (Current pincode of the customer), a borrower’s id is equivalent to a person's name. This index is not a factor affecting credit risk and is deleted. Similarly, the indexes X5 (Branch where the loan was disbursed), X6 (Vehicle Dealer where the loan was disbursed), X7 (Vehicle manufacturer (Hero, Honda, TVS)), X12 (State of disbursement), and X13 (Employee of the organization who logged the disbursement) are randomly assigned by the system, and have no real impact on the credit risk assessment and should also be deleted. In addition, as the index values of index X14 in all the loan records are 1, index X14 does not have a predictive role in credit risk assessment and it should be deleted. Using this approach to screen, 8 indexes are eliminated. Thus, 45 indexes related to customer information and 1 dependent variable index remain in the data set, as shown in Table 4.

Table 4 Resulting credit risk assessment indexes

Transforming unbalanced data set

After the data preprocessing, we convert the unbalanced data set into a balanced data set. Traditional machine learning algorithms focus on the overall accuracy, and the trained classifiers tend to favor the majority category in the training process [40,41,42], while the prediction accuracy of the minority category is very low. We proposed a Smote-Tomek Link algorithm to convert the imbalanced data set into a balanced dataset, to improve the prediction accuracy of the minority category and the overall classification effect of the data set.

Smote-Tomek Link algorithm

In this section, based on the traditional Smote algorithm [42,43,44], a Smote-Tomek Link algorithm is proposed to transform the unbalanced data set into the balanced one.

The basic steps of the Smote-Tomek Link algorithm are as follows: (i) Select n minority class sample points randomly using the Smote algorithm, and find m subclass sample points closest to these n minority class sample points. (ii) Select any point in the nearest m subclass samples, and this point is a new data sample. On this basis, an integrated Smote-Tomek Link algorithm is designed by combining with Tomek Link. The basic ideas are given as follows.

For the newly generated data point and the point closest to other non-new sample points, a pair of Tomek link is formed. Then the rules are defined: a space is framed with the new generation point as the center and the distance of Tomek Link as the range radius.

If the number of minority classes or majority classes in this space is less than the minimum threshold, then the new generation point is considered as a "trash point", and this kind of point is removed or have been another SMOTE training. If the number of minority classes or majority classes in the space is greater than or equal to the minimum threshold, then we can sample in the set of minority classes samples that have reserved and put in SMOTE training. According to this rule, the "trash points" are eliminated, and the new data that meets the criteria are retained. Repeat the above steps, and finally add the generated sample to the data set to get a new balanced data sample set.

Unbalanced data set transformation based on Smote-Tomek Link algorithm

From section “Data preparation and preprocessing”, 117,156 loan records are obtained, of which 93,315 records are the auto loan data of the non-defaulters, and 23,841 are the auto loan data of the defaulters. The imbalance ratio of the data set is almost four times, which would affect the model effect. Thus, we use the Smote-Tomek Link algorithm proposed in Subsection “Smote-Tomek Link algorithm” to process and transform the imbalanced data set into a balanced data set. To highlight the superiority of this algorithm in processing the data set, several machine learning models are adopted to make predictions and the effects are compared using the relevant evaluation indexes.

(1) Experimental methods

We use the Smote and Smote-Tomek Link algorithm to process data set T, yielding two data sets T1 and T2 respectively. The data sets T, T1, and T2 are further divided into 70–30 training-test sets individually. Then, we apply two machine learning methods, i.e., the Logistic Regression (LR) model and the Random Forest (RF) model as the classifiers for training and prediction. The effect of the models is compared using relevant evaluation indexes such as F1-score, G-means, MCC, and AUC [45,46,47].

(2) Evaluation indexes of unbalanced learning

For a two-category problem in machine learning, the majority category is usually labeled the negative category, while the minority category with high recognition importance is labeled the positive category. Based on the true category of the sample and the category predicted by the classifier, there are four classification types: true positive (TP), false positive (FP), true negative (TN), and false negative example (FN). Among them, TP and TN are the positive (negative) samples that are correctly predicted by the classifier. FP and FN represent that the sample is a negative (positive) category, and the sample is wrongly predicted as a positive (negative) category by the classifier. Table 5 shows the confusion matrix of the classification results.

Table 5 Confusion matrix of classification results

From the confusion matrix, the recall rate R and precision rate P are found using [45, 47, 48]:

$$ R = \frac{TP}{{TP + FN}},\;P = \frac{TP}{{TP + FP}}. $$

(i) F1-measure

The F1-measure is the harmonic mean of the recall rate R and precision rate P, which can evaluate the overall classification of unbalanced data sets [45,46,47]. The larger the value of F1, the better is the classification effect of the classifier, with.

$$F1{ - }measure = \frac{2}{{\frac{1}{P} + \frac{1}{R}}} = \frac{2PR}{{P + R}}.$$

(ii) G-means

The G-means evaluates the performance of the unbalanced data classification. For an unbalanced data set, the value of the G-means will be high only if the classification accuracy of positive category samples and negative category samples is relatively high. Otherwise, the value of G-means will be low. The G-means is expressed as follows [45,46,47]:

$$ G{ - }means = \sqrt {\frac{TP}{{TP + FN}} \times \frac{TN}{{TN + FP}}} $$

(iii) MCC

The Markov Correlation Coefficient (MCC) is an important index to evaluate the performance of unbalanced data classification. In general, the greater the MCC, the better is the classification effect of the model. The MCC is expressed as [45,46,47]:

$$ MCC = \frac{TP \times TN - FP \times FN}{{\sqrt {\left( {TP \,{+}\, FP} \right) \,{\times}\, \left( {TP \,{+}\, FN} \right) \,{\times}\, \left( {TN \,{+}\, FP} \right) \,{\times}\, \left( {TN \,{+}\, FN} \right)} }} $$

(iv) AUC

The AUC is the area under the ROC (Receiver Operating Characteristic) curve, and is a common index to measure the overall classification performance of the classifier [45,46,47]. The F1, G-means, and MCC assessment indexes are based on thresholds, but AUC is not related to the selection of a threshold.

(3) Analysis of experimental results

For the original untreated data set, the data set processed by the Smote algorithm, and the data set processed by the Borderline-Smote algorithm, two machine learning methods (LR model, RF model) are used as the classifiers to train and predict. The final classification results are shown in Table 6. Panels A and B show the classification result obtained by the LR and RF models respectively.

Table 6 Classification results based on LR and RF models

From Table 6, when the imbalanced data set is not processed, the fitting effect of both the LR and RF models is extremely poor. This is because the distribution of the majority category and the minority category in the data set is uneven. As a result, the model tends to predict the minority category into the majority category during training, thus lowering the prediction accuracy of the minority category. Using the Smote algorithm or the Smote-Tomek Link algorithm to process the data greatly improves the property of the classifier. From the F1, G-means, MCC and AUC values, using the same classifier to train and predict the data set processed by the Smote algorithm and the data set processed by the Smote-Tomek Link algorithm, the results suggest that the classification effect and predicting performance using the Smote-Tomek Link algorithm to be better than that of the Smote algorithm. Thus, we use the Smote-Tomek Link algorithm to transform the imbalanced data set into a balanced data set.

Feature selection method of credit risk assessment index

From the balanced data set, 186,630 auto loan records are obtained. From them, 45 features (indexes) are used to reflect the borrower's auto loan information. Due to the large number of feature dimensions of the auto loan borrowers, there may be features that are irrelevant or redundant to credit risk. Therefore, it is necessary to make a feature selection of these 45 features to further screen the indexes and simplify the feature subsets, so as to reduce the dimension of the feature space. In this way, the generalizability of the established credit risk assessment model of personal auto loans can be enhanced and any overfitting can be reduced.

Improved Filter-Wrapper feature selection method

An improved Filter-Wrapper feature selection method is presented for selecting the main features from among the 45 preliminary indexes in Table 4. In the Filter stage, three evaluation criteria, namely, Relief algorithm [48, 49], Maximal Information Coefficient [50], and Quasi-separable method [51], are used to evaluate the importance of the features from three aspects: information relevance, information quantity, and quasi-separable ability. A fusion model of multiple evaluation criteria is constructed to rank the importance of the features. To overcome the subjectivity in determining the weight coefficients of the feature importance, the order relation analysis method [51,52,53] is used to determine the corresponding weight of the three evaluation criteria, the classification accuracy is used as the measurement standard, and the SBS method [31] is used to rank the 45 preliminary features. The lower the rank order, the lesser is the importance of that feature. At the same time, the feature subset after each deletion is trained and predicted, so as to obtain the classification accuracy of the data set. The feature subset is then evaluated on the classification accuracy, and the optimal feature subset is found.

(1) Filter stage

It is difficult for an evaluation criterion to comprehensively evaluate the quality of the feature subsets. If the evaluation criteria are combined, they can complement each other and improve the evaluation quality. For the 45 preliminary features listed in Table 4, three evaluation criteria: Relief algorithm, Maximum Information Coefficient method, and Quasi-separable method, are selected.

The dimensionality of the three evaluation criteria is different, which may lead to significant differences in the corresponding values of the features and affect the subsequent fusion process of the evaluation criteria, resulting in large deviations in the results. With this in mind, the dimensions of the three evaluation criteria are harmonized using:

$$ Re_{i} = \frac{{re_{i} - \mathop {\min }\nolimits_{i} {\kern 1pt} {\kern 1pt} re_{i} }}{{\mathop {\max }\nolimits_{i} {\kern 1pt} {\kern 1pt} re_{i} - \mathop {\min }\nolimits_{i} {\kern 1pt} {\kern 1pt} re_{i} }},{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} i = 1,2,....,45 $$
$$ M_{i} = \frac{{m_{i} - \mathop {\min }\nolimits_{i} {\kern 1pt} {\kern 1pt} m_{i} }}{{\mathop {\max }\nolimits_{i} {\kern 1pt} {\kern 1pt} m_{i} - \mathop {\min }\nolimits_{i} {\kern 1pt} {\kern 1pt} m_{i} }},{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} i = 1,2,....,45 $$
$$ C_{i} = \frac{{c_{i} - \mathop {\min }\nolimits_{i} {\kern 1pt} {\kern 1pt} c_{i} }}{{\mathop {\max }\nolimits_{i} {\kern 1pt} {\kern 1pt} c_{i} - \mathop {\min }\nolimits_{i} {\kern 1pt} {\kern 1pt} c_{i} }},{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} i = 1,2,....,45 $$

where rei, mi and ci are the values obtained by the Relief algorithm, Maximum Information Coefficient method, and Quasi-separable method, respectively. The max and min represent the maximum and minimum values respectively. Rei, Mi, and Ci are the values after range standardization.

Though the three evaluation criteria are measured differently, they all conform to the rule namely the greater the weight of feature i, the stronger is the classification ability of that feature. Thus, the values obtained by the three evaluation criteria are fused to form a fusion model of multiple evaluation criteria. The fusion evaluation value of feature i is denoted as totali, which denotes the importance degree of feature i, which is written as

$$ total_{i} = w_{1} Re_{i} + w_{2} C_{i} + w_{3} M_{i} $$
(1)

where \(w_{1}\), \(w_{2}\) and \(w_{3}\) are the weights corresponding to the Relief algorithm, Maximum Information Coefficient method and Quasi-separable method, respectively.

As the influence of each evaluation criterion on the result is different, their weights are different, and the determination of the weights will be related to the fitting effect of the subsequent model, so determining the weights is key. For this, we employ the order relation analysis method [51,52,53] to obtain the weights of the evaluation criteria, as shown in Fig. 1.

Fig. 1
figure 1

Steps for order relation analysis method

The steps to determine the weights are as follows.

Step 1: Determine the order relationship among the evaluation criteria. From the effect of the Relief algorithm, Maximum Information Coefficient method and Quasi-separable method, the rank relation among the evaluation criteria is as follows:

$$ U_{1} > U_{2} > U_{3} $$

where \(U_{1}\) is the Relief algorithm, \(U_{2}\) is the Quasi-separable method, and \(U_{3}\) is the Maximum Information Coefficient method, respectively.

Step 2: Obtain the relative importance of the three evaluation criteria using comparative judgment. Suppose the ratio of the importance of evaluation criteria \(U_{k - 1}\) to \(U_{k}\) is \(\gamma_{k}\) [51, 52], that is,

$$ \gamma_{k} = \frac{{U_{k - 1} }}{{U_{k} }},\;k = 2,3,...,n $$
(2)

where the value of \(\gamma_{k}\) is as defined in Table 7.

Table 7 Value of γk and description

Using Table 7 and Eq. (2), the importance of the order relation among the three evaluation criteria can be assessed. The Relief algorithm is slightly more important than the Quasi-separable method, which is slightly more important than the Maximum Information Coefficient method. Thus, we have:

$$ \gamma_{2} = \frac{{U_{1} }}{{U_{2} }} = 1.2,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \gamma_{3} = \frac{{U_{2} }}{{U_{3} }} = 1.2 $$
(3)

Step 3: Compute the importance weight \(w_{m}\). The ranking of the weights of the three evaluation criteria is consistent with their corresponding positions in the order relation among them. The importance weights are found [51, 52] as follows:

$$ w_{m} = \left( {1 + \sum\limits_{k = 2}^{m} {\prod\limits_{i = k}^{m} {\gamma_{k} } } } \right)^{ - 1} $$
(4)
$$ w_{k - 1} = r_{k} w_{k} ,\left( {k = m,m - 1,...,2} \right) $$
(5)

Combining Eqs. (4) and (5) yields

$$ w_{3} = \left( {1 + \gamma_{2} \times \gamma_{3} + \gamma_{3} } \right)^{ - 1} = \left( {1 + 1.2 \times 1.2 + 1.2} \right)^{ - 1} = 0.2747 $$
(6)
$$ w_{2} = \gamma_{3} w_{3} = 1.2 \times 0.27473 = 0.3297 $$
(7)
$$ w_{1} = \gamma_{2} w_{2} = 1.2 \times 0.32968 = 0.3956 $$
(8)

Thus, the importance weights of the Relief algorithm, Maximum Information Coefficient method and Quasi-separable method are 0.3956, 0.3297 and 0.2747, respectively, satisfying \(w_{1} + w_{2} + w_{3} = 1\).

Step 4: Compute the fusion evaluation value \(total_{i}\). Substituting \(w_{1}\), \(w_{2}\) and \(w_{3}\) into Eq. (1), the fusion model of multiple evaluation criteria is expressed as

$$ total_{i} = 0.3956Re_{i} + 0.3297C_{i} + 0.2747M_{i} $$
(9)

Step 5: Rank the features. Using the fusion evaluation value \(total_{i}\), the features are now ranked.

Figure 2 shows the flowchart of the comprehensive ranking of the features during the Filter stage.

Fig. 2
figure 2

Flowchart of comprehensive ranking of the features

(2) Wrapper stage

The feature selection now enters the Wrapper stage, where we will further screen the sorted features, simplify the feature subset, and reduce the dimension, so as to improve the accuracy of classification. Here, the RF [54] is selected as a classifier at the encapsulation stage, and an SBS method is used to eliminate the features in accordance with the rank order of the features. From the complete feature set, at each iteration, the least important feature is removed. At the same time, the classifier is used to train and predict the current feature subset, so as to obtain the classification accuracy under this feature subset and compare it with the classification accuracy obtained in the previous iteration. The feature subset with the highest classification accuracy (that is, the optimal feature subset) is selected as the result of the evaluation index selection for the credit risk of the personal auto loans.

The steps of the Wrapper algorithm are as follows.

Input: The original data set F = {f1, f2, , fk}, where k is the number of original feature sets; k = 45.

Output: Select the optimal feature subset with the highest classification accuracy.

Figure 3 shows the algorithm flowchart.

Fig. 3
figure 3

Flowchart of Wrapper algorithm

Analysis of selection of credit risk assessment indexes

Comprehensive ranking of features in Filter stage

Using the method described in Sect. “Data preparation and preprocessing”, the three evaluation criteria are used to evaluate the importance of features in the Filter stage, and the evaluation value of each feature is obtained. Then, the evaluation value is standardized to ensure dimensional consistency. Next, the fusion model of multiple evaluation criteria is used for information fusion to obtain the fusion evaluation value of each feature, and the importance of each feature is then ranked according to the value of the fusion evaluation. The results obtained using Python are shown in Table 8.

Table 8 Comprehensive ranking of 45 preliminary features

Feature selection in Wrapper stage

With the 45 preliminary features in Table 8 ranked, we now use the Wrapper algorithm to select the optimal feature subset. To ensure a better classification accuracy, we use the ten-fold cross validation [55] to accept the average classification accuracy after ten predictions. Figure 4 shows the change in classification accuracy as data dimension decreases. From Fig. 4, when the data dimension is 34, the classification accuracy of the RF classifier reaches the highest level. Thereafter, the classification accuracy decreases. Thus, the first 34 features in the comprehensive ranking are chosen as the optimal feature subset, that is, they form the credit risk assessment indexes of the personal auto loans.

Fig. 4
figure 4

Classification accuracy rate vs. data dimension

Credit risk assessment of personal auto loans using PSO-XGBoost model

Next, a credit risk assessment model of the personal auto loans based on the PSO-XGBoost model is formed. The XGBoost model [56] has good characteristics such as high prediction accuracy and fast runtime, while the Particle Swarm Optimization (PSO) algorithm [57,58,59] is used to optimize the parameters in the XGBoost model. Then, the PSO-XGBoost model, XGBoost model, RF model, and LR model [60] are used on the training data set. Prediction is made on the test data set to obtain their respective prediction outcomes, and the performance of the four models are evaluated and compared against performance evaluation indexes such as accuracy and precision, ROC curve, and AUC value.

XGBoost model

XGBoost (eXtreme Gradient Boosting) [56] is a C +  + realization based on the Gradient Boosting Machine algorithm, which is a boosting algorithm. The XGBoost model seeks to constantly add trees, and split the features to make the trees grow. As the data set is divided into a training set and test set using a 7:3 ratio, so, 130,641 records of auto loan data are selected to form the XGBoost model. The training set D, containing 130,641 samples with 34 features, is expressed as \(D = \left\{ {\left( {x_{i} ,y_{i} } \right)} \right\}\left( {\left| D \right| = 130,641,x_{i} \in R^{31} ,y_{i} \in R} \right)\), where \(x_{i}\) represents the i-th sample, \(y_{i} = \left\{ {\left. {0,1} \right\}} \right.\) represents the category of default group, with 0 (1) being the non-default (default) group, respectively.

Now, suppose the total number of trees is K, then the predicted value of K to the sample is [56]

$$ \hat{y}_{i} = \phi \left( {x_{i} } \right) = \sum\limits_{k = 1}^{K} {f_{k} \left( {x_{i} } \right),f_{k} \in F} $$
(10)
$$ F = \left\{ {f\left( x \right) = w_{q\left( x \right)} } \right\}\left( {q:R^{31} \to T,w \in R^{T} } \right) $$
(11)

where \(\hat{y}_{i}\) is the predicted value of the model representing the predicted category label of sample \(i\), \(F\) is the set of classification and regression trees (CART), \(f\left( x \right)\) is a regression tree, and \(w_{q\left( x \right)}\) represents the set of all node scores of this tree, namely, the prediction of samples; q represents the classification of samples on the leaf node, that is, input a sample, map the sample to the predicted category output by the leaf node according to the model, and judge whether it is a non-defaulting or defaulting population; \(w\) is the leaf node score, and T is the number of leaf nodes in the tree.

From Eq. (10), we note that the predicted values of the XGBoost model are the sum of the predicted values of the K trees. To learn these K trees, we define the objective function, which contains a loss function and a regularization function [56], and this can be expressed as

$$ Obj = L\left( v \right) + \Omega \left( v \right) = \sum\limits_{i} {l\left( {\hat{y}_{i} ,y_{i} } \right)} + \sum\limits_{k} {\Omega \left( {f_{k} } \right)} $$
(12)

where \(L\left( v \right)\) is the loss function, which can evaluate the fitting degree of the model; \(\Omega \left( v \right)\) is the regularization function used to simplify the model and control its complexity; \(\hat{y}_{i}\) is the predicted value of the model representing the predicted category label of sample \(i\); \(y_{i}\) is the true category label of sample \(i\); \(l\left( {\hat{y}_{i} ,y_{i} } \right)\) is used to measure the deviation degree between the actual and the predicted values obtained by the credit risk assessment model, which is a non-negative real valued function; \(k\) is the number of trees, and \(f_{k}\) is the kth tree.

The term \(\Omega \left( {f_{k} } \right)\) in Eq. (12) is the regularization term [56], which is given by

$$ \Omega \left( {f_{k} } \right) = \gamma T + \frac{1}{2}\lambda \left\| w \right\|^{2} $$
(13)

where T is the number of leaf nodes in each tree, \(w\) is the set of leaf node scores in each tree, \(\gamma\) is the leaf weight, \(\lambda\) is the penalty coefficient. \(\gamma\) and \(\lambda\) jointly determine the model’s complexity.

According to the XGBoost model, the newly generated tree is the residue after fitting the previous round. Therefore, when t trees are generated, Eq. (10) can be written as

$$ \hat{y}_{i}^{\left( t \right)} = \hat{y}_{i}^{{\left( {t - 1} \right)}} + f_{t} \left( {x_{i} } \right) $$
(14)

Substituting Eq. (14) into Eq. (12), the objective function can be rewritten as [56]

$$ Obj^{\left( t \right)} = \sum\limits_{i = 1}^{n} {l\left( {y_{i} ,\hat{y}_{i}^{{\left( {t - 1} \right)}} + f_{t} \left( {x_{i} } \right)} \right) + \Omega \left( {f_{t} } \right)} $$
(15)

The goal is to find the value \(f_{k}\) that minimizes Eq. (15). In Gradient Boosted Decision Tree (GBDT), only the first-order gradient is adopted. Compared to the GBDT, the XGBoost model rewrites the objective function using a second-order Taylor series. Thus, Eq. (15) is approximated by

$$ Obj^{\left( t \right)} \approx \sum\limits_{i = 1}^{n} {\left[ {l\left( {y_{i} ,\hat{y}_{i}^{{\left( {t - 1} \right)}} } \right) + g_{i} f_{t} \left( {x_{i} } \right) + \frac{1}{2}h_{i} f_{t}^{2} \left( {x_{i} } \right)} \right] + \Omega \left( {f_{t} } \right)} $$
(16)

where \(g_{i} = \partial_{{\hat{y}_{i}^{{\left( {t - 1} \right)}} }} l\left( {y_{i} ,\hat{y}_{i}^{{\left( {t - 1} \right)}} } \right)\), \(h_{i} = \partial_{{\hat{y}_{i}^{{\left( {t - 1} \right)}} }}^{2} l\left( {y_{i} ,\hat{y}_{i}^{{\left( {t - 1} \right)}} } \right)\) are the first- and second-order derivatives of the loss function with respect to \(\hat{y}_{i}^{{\left( {t - 1} \right)}}\). As \(l\left( {y_{i} ,\hat{y}_{i}^{{\left( {t - 1} \right)}} } \right)\) is a constant, Eq. (16) can be rewritten as

$$ Obj^{\left( t \right)} = \sum\limits_{i = 1}^{n} {\left[ {g_{i} f_{t} \left( {x_{i} } \right) + \frac{1}{2}h_{i} f_{t}^{2} \left( {x_{i} } \right)} \right] + \Omega \left( {f_{t} } \right)} $$
(17)

Clearly, \(Obj^{\left( t \right)}\) depends on the first- and second-order derivatives of each data point on the error function. Thus, the iteration about the tree is turned into an iteration about the leaf node, and the following results can be obtained [56].

$$ \begin{gathered} Obj^{\left( t \right)} = \sum\limits_{i = 1}^{n} {\left[ {g_{i} f_{t} \left( {x_{i} } \right) + \frac{1}{2}h_{i} f_{t}^{2} \left( {x_{i} } \right)} \right] + \Omega \left( {f_{t} } \right)} \hfill \\ = \sum\limits_{i = 1}^{n} {\left[ {g_{i} w_{{q\left( {x_{i} } \right)}} + \frac{1}{2}h_{i} w_{{q\left( {x_{i} } \right)}}^{2} } \right]} + \gamma T + \lambda \cdot \frac{1}{2}\sum\limits_{j = 1}^{T} {w_{j}^{2} } \hfill \\ = \sum\limits_{j = 1}^{T} {\left[ {\left( {\sum\limits_{{i \in I_{J} }} {g_{i} } } \right)w_{j} + \frac{1}{2}\left( {\sum\limits_{{i \in I_{j} }} {h_{i} } } \right)w_{j}^{2} } \right] + \gamma T + \lambda \cdot \frac{1}{2}\sum\limits_{j = 1}^{T} {w_{j}^{2} } } \hfill \\ = \sum\limits_{j = 1}^{T} {\left[ {\left( {\sum\limits_{{i \in I_{J} }} {g_{i} } } \right)w_{j} + \frac{1}{2}\left( {\sum\limits_{{i \in I_{j} }} {h_{i} + \lambda } } \right)w_{j}^{2} } \right] + \gamma T} . \hfill \\ \end{gathered} $$
(18)

Therefore, the problem is transformed into a problem of finding the extreme value of a quadratic function in wj. That is, we must find the optimal value of \(w_{j}\) that minimizes Eq. (18). Using the method of solving the extremum of a quadratic function, we obtain \({w}_{j}^{*}\) and the minimum value of the objective function [56] as follows.

$$ w_{j}^{*} = - \frac{{G_{j} }}{{H_{i} + \lambda }},\;Obj = - \frac{1}{2}\sum\limits_{j = 1}^{T} {\frac{{G_{j}^{2} }}{{H_{j} + \lambda }}} + \gamma T $$
(19)

Compared to the GBDT model, the XGBoost model adds the regularization term to the objective function of the credit risk assessment model to prevent the model from overfitting. At the same time, Taylor expansion is used to optimize the objective function to find the best segmentation point in the CART regression tree. Therefore, the constructed credit risk assessment model has higher accuracy and better fitting performance than the other models.

PSO-XGBoost model

The XGBoost model often adjusts the parameters manually, resulting in a longer search time and higher computational cost. If the PSO algorithm is used to optimize the parameters of XGBoost model, each parameter is coded into particles in the space. According to the PSO algorithm, the optimal parameters of XGBoost model are searched within a fixed number of iterations, so as to find the optimal solution to the XGBoost model. Thus, we integrate the PSO algorithm into the parameter optimization of XGBoost model to form the PSO-XGBoost model. This method has fast convergence, higher precision, and lower cost. The steps of the PSO-XGBoost model are shown in Fig. 5.

Fig. 5
figure 5

Algorithm flow of PSO-XGBoost model

We must first determine the parameters to be optimized for the XGBoost model. As the accuracy of the XGBoost model is important, three parameters are selected for optimization, i.e., the learning rate, maximum depth of the tree, and the sample weight of the minimum leaf node. Thus, the dimension of the particle swarm space in the PSO algorithm is 3. Next, the maximum number of iterations, learning factor, inertia weight, and the number of particles N in the PSO must be determined. Finally, the PSO-XGBoost model is constructed, and the predicted error rate is taken as the fitness of the PSO algorithm, that is, the calculated error rate function is taken as the fitness function.

The next steps are to initialize the entire particle swarm in three-dimensional space (including each particle's position and velocity) and to compute the error rate of each particle after initialization according to the error rate function. The local and global optimal values of the entire particle swarm are obtained by comparison. We then determine whether the termination condition is met (i.e., the maximum number of iterations is breached). If the termination condition is not met, then the error rate and the corresponding parameter value are output, and the velocity and position of each particle are updated. The error rate of each particle after the update is computed using the error rate function, and the error rate of each particle is compared with the current local and global optimal values. If the error rate is less than the optimal value, then the optimal value is replaced with the current error rate; otherwise, the optimal value holds. Next, if the current iteration number has not reached the maximum number set, then the iteration continues until the termination condition is satisfied, and the final optimal value is output. From the optimal values found, the best parameter values of the model are now known, and the PSO-XGBoost model can then be used to assess the credit risk of the personal auto loans.

Analysis of credit risk assessment of personal auto loans

(1) Data set partitioning

From Sect. “Literature Review”, 186,630 auto loan records can be used for the empirical analysis, and they are divided into a training set and test set in a 7:3 ratio, as shown in Table 9.

Table 9 Information on data set

(2) Parameter optimization

To improve the execution and classification performance of the XGBoost model, a parameter adjustment of the XGBoost model is required. For this, the PSO algorithm is used to optimize the parameters of the XGBoost model.

When forming the XGBoost model, three parameters need to be adjusted, i.e., learning rate (learning_rate), maximum depth of the tree (max_depth), and the sample weight of minimum leaf node (min_child_depth), to improve the accuracy of the XGBoost model. The steps are detailed as follows:

  1. (i)

    Learning rate: the step size is used in the updating process to prevent overfitting. After each update, the weight of the new feature is obtained. Reducing the weight of the feature ensures a more conservative computation. The step size and the maximum number of iterations usually jointly determine the fitting effect of the algorithm, and the robustness of the model can also be improved by reducing the weight at each step.

  2. (ii)

    Maximum depth of the tree: for the maximum depth of the decision tree in the XGBoost model, if no specific value is entered, a default value is assumed. So, the decision tree does not limit the depth of the subtree when it is created. However, if the model sample has a large amount of data and many features, it needs to be limited, so as to avoid overfitting.

  3. (iii)

    Sample weight of minimum leaf node: this is similar to the parameter min_child_leaf in the gradient lift tree algorithm. The parameter min_child_leaf in the gradient lift tree algorithm represents the total number of minimum samples, while min_child_weight in the XGBoost model represents the sum of the minimum sample weights, which are also used to avoid overfitting situations. When the value of min_child_weight is large, the model can avoid learning some local samples, so the value of this parameter can be adjusted to avoid model overfitting.

Three parameters: learning rate (learning_rate), maximum depth of the tree (max_depth), and sample weight of minimum leaf node (min_child_depth) in the XGBoost model are adjusted by the PSO algorithm on the 130,641 data points in the training set, to ensure model optimization and improve the accuracy of the model prediction. In the iterative process of optimizing the three parameters, the error rate of the XGBoost model is used as the fitness evaluation function in the PSO algorithm. Figure 6 shows the correlation of the number of iterations and the error rate of the model.

Fig. 6
figure 6

Parameter optimization iteration of PSO-XGBoost model

Figure 6 shows that the PSO algorithm continues to optimize the parameters as the number of iterations increases, with a decreasing error rate of the model. When a stationary value is reached, the optimal value of the parameter is found and the PSO-XGBoost model has a minimum error rate.

Performance evaluation of PSO-XGBoost model

To evaluate model performance, the PSO-XGBoost model is compared with the XGBoost, RF, and LR models [60]. As our problem studied is a two-category problem, we use evaluation indexes such as accuracy, precision, complexity, ROC curve and AUC value to evaluate the effect of the models.

(1) Confusion matrix

Expanding on Table 5, we provide a confusion matrix to visualize the model’s outcome (see Table 10).

Table 10 Confusion matrix

From the confusion matrix in Table 10, there are four possibilities for the results predicted by the model. The first is the true positive example (TP), i.e., the borrower has already defaulted previously, and the model predicts the borrower to belong to a high-risk group, which is very likely to breach the contract. Therefore, the agency should be highly alert of such borrowers. The second is the false negative example (FN), that is, in reality, the customer has a default, but the model wrongly predicts the customer to belong to a low-risk group. Approving such customers will cause a huge financial loss to the auto financing firms. The third cell is the false positive example (FP), which means that, the borrower has no default at all. However, the model predicts the customer is a high-risk borrower capable of defaulting on the loan. Such borrowers will be filtered out by the institution and potential revenue will be lost. A similar argument applies to the fourth category—the true negative example (TN).

In this paper, the data set is divided into a training and a test set, and the PSO-XGBoost, XGBoost, RF, and LR models are used for training and prediction, as shown by the confusion matrix of Table 11.

Table 11 Model comparison by confusion matrix

(2) Evaluation indexes of model performance

(i) Accuracy and error

Accuracy is the proportion of the number of correctly predicted samples in the total number of samples [24, 34, 61], expressed as:

$$ Accuracy = \frac{TP + TN}{{TP + FN + FP + TN}} $$
(20)

Error is the proportion of the number of incorrectly predicted samples in the total number of samples [24, 34, 61], expressed as:

$$ Error = \frac{FN + FP}{{TP + FN + FP + TN}} $$
(21)

The higher the accuracy, the smaller the error, and the better is the effect of the classifier model and vice versa.

(ii) Precision and Recall

Precision refers to the proportion of true positive samples in the total positive samples judged by the model [24, 34, 61], that is,

$$ \, P = \frac{TP}{{TP + FP}} $$
(22)

Recall rate refers to the proportion of positive samples judged by the model in the total actual positive samples [24, 34, 61], that is,

$$ R = \frac{TP}{{TP + FN}} $$
(23)

Using Eqs. (20)-(23), the evaluation indexes of each model are obtained as shown in Table 12.

Table 12 Comparison of evaluation indexes of models

It can be seen from Table 12 that the classification accuracy of XGBoost model is 78.88%, while the Accuracy of PSO-XGBoost model is 83.11%, increasing by 4.23%, greatly improving the classification accuracy. At the same time, the Precision and Recall of PSO-XGBoost model are better than those of the XGBoost model, which indicates that the evaluating performance of PSO-XGBoost model is better than that of the XGBoost model. The Logistic regression model had the worst evaluation performance among the four models, because all its evaluation indexes are the lowest. The classification accuracy of the RF model is 80.67%, and that of the PSO-XGBoost model is 2.44% higher than it. In terms of Precision and Recall, the PSO-XGBoost model is also better than that of the RF model. In conclusion, among the four models, the PSO-XGBoost model had the best performance for credit risk evaluation of personal auto loan than other three models.

(3) Complexity.

The complexity of all the algorithms being compared (PSO-XGBoost, XGBoost, RF, and LR) are measured from the two dimensions of time and space. The time dimension refers to the running time taken to execute the current algorithm, which is called time complexity. The spatial dimension refers to the amount of memory required to perform the current algorithm, which is called spatial complexity. The calculation results of time complexity and space complexity of each algorithm are shown in Table 12. From these results, we can see that the amount of memory required to perform the current algorithm of the proposed PSO-XGBoost is only 77 M, which slightly higher than that of the other methods, but it takes up very little memory. Similarly, the running time taken to execute the current algorithm of the proposed PSO-XGBoost is only 9 s, which means that the time complexity of the proposed algorithm is not high.

(4) ROC curve and AUC value.

The ROC curve ranks the samples according to the predicted results of the learner, and then makes the predictions sequentially by treating the sample as a positive example. The sensitivity TPR and specificity FPR are found each time [24, 34, 61], using

$$ TPR = \frac{TP}{{TP + FN}},\;FPR = \frac{FP}{{FP + TN}} $$

The TPR is taken as the horizontal axis and the specificity FPR as the vertical axis when mapping the ROC curve. The AUC is the area under the ROC curve. When the ROC curve of one learner is completely wrapped by the ROC curve of another learner, then it can be safely assumed that the performance of the first learner is better than that of the latter. However, when two curves intersect, a more reasonable judgment is to compare the values of the respective AUC’s. The higher the AUC value, the better is the performance of the learner.

Figure 7 shows the comparison diagram of ROC curves and AUC values of PSO-XGBoost model, XGBoost model, RF model and LR model. According to the comparison rule of ROC curve [45,46,47], the closer the ROC curve is to the upper left corner, the better its evaluation performance will be. From Fig. 7, the ROC curve of the PSO-XGBoost model is the closest to the top left corner, and covers the ROC curves of the other three models. Furthermore, the AUC value of the PSO-XGBoost model is 0.90, which is better than other three models’ AUC. Hence, the PSO-XGBoost model has the best performance and the highest prediction accuracy for the credit risk evaluation of personal auto loans, which affirms the results offered by the earlier evaluation indexes.

Fig. 7
figure 7

ROC curves of compared models

Further analysis of model performance

In this subsection, to judge the performance of the proposed model in this paper sufficiently, another new experiment is provided to perform a comparative analysis to enrich our claim on generalization, where the selected data set is from a Chinese vehicle loan agency that is publicly available on the Kaggle platform. The data set for this new experiment can be downloaded from the website: https://www.kaggle.com/xiaochou/auto-loan-default-risk.

Data processing and feature selection

The selected data set contains 199,717 customer loan records, where 164,289 loan records represent the information of customers who have not defaulted, and 35,428 loan records represent the information of customers who have defaulted. Moreover, the whole data set contains 54 indexes, where 53 indexes are the information indexes used to predict customer loan default, which are known as independent variables, and mainly reflect the customer's personal basic information, economic status and credit record information. Another index, Loan_default, is a dependent variable, which is used to mark whether a customer has defaulted. The decision-making task is to establish a risk identification model to predict borrowers who may default.

First of all, data processing and transformation are carried out for the type values, abnormal value and the missing values in the data set. Then, the credit risk assessment indexes of auto loan in this data set are preliminarily screened, and 42 independent variable indexes and 1 dependent variable index are retained. Due to the great difference between the default information and the non-default information of this data set, and the imbalance degree is nearly five times, so it is necessary to conduct unbalanced processing on this data set to transform it into a balanced auto loan data set. The Smote-Tomek Link algorithm proposed in Sect. “Smote-Tomek Link algorithm” is used for unbalanced processing, so as to improve the prediction accuracy of minority category and improve the overall classification effect of unbalanced data sets. Finally, the improved Filter-Wrapper feature Selection method proposed in Sect. “Improved Filter-Wrapper feature selection method” is selected for feature selection, and 30 features are selected as the optimal features, as shown in Table 13.

Table 13 The optimal feature subset

The decision-making process of risk assessment

(1) Data set partitioning

From Sect. “Data processing and feature selection”, 199,717 auto loan records can be used for the empirical analysis, and they are divided into a training set and test set in a 7:3 ratio, as shown in Table 14.

Table 14 Information on data set

(2) Parameter optimization

The PSO algorithm is used to optimize the three parameters of XGBoost model, i.e., learning_rate, max_depth and min_child_depth. In the process of iterative optimization with these three parameters, the error rate of XGBoost model is used as the fitness evaluation function in PSO algorithm, and the correlation graph between the number of iterations and the error rate of model is obtained, as shown in Fig. 8.

Fig. 8
figure 8

Parameter optimization iteration of PSO-XGBoost model

Figure 8 shows the specific variation trend of the error rate of PSO-XGBoost model. As the number of iterations increases, PSO algorithm continues to optimize parameters, and the error rate of the model decreases. When a stationary value is reached, the optimal value of the parameter is found and the PSO-XGBoost model has a minimum error rate.

(3) Performance evaluation

To verify the performance of the PSO-XGBoost model on this dataset, we have compared the proposed PSO-XGBoost model with several congeneric works, i.e., the XGBoost model, RF model and LR model using the performance evaluation indexes such as accuracy, precision, complexity, ROC curve and AUC value to evaluate the effect of the models. The evaluation indexes of each model are obtained as shown in Table 15.

Table 15 Comparison of evaluation indexes of models

According to the calculation results of the evaluation indexes in Table 13, we can see that the Accuracy, Precision and Recall of PSO-XGBoost model are all better than those of the other three models. Thus, the PSO-XGBoost model is found to be superior on classification performance and classification effect. In addition, the time complexity degree and the space complexity degree of the proposed PSO-XGBoost are not high, which shows that our proposed model is effective and operable.

In addition, the ROC curves and AUC values of four compared models (PSO-XGBoost, XGBoost, RF, and LR) are plotted in the same figure, as shown in Fig. 9.

Fig. 9
figure 9

ROC curves of compared models

As can be seen from Fig. 9, the ROC curve of PSO-XGBoost model is the closest to the upper left corner, followed by those of RF model, XGBoost model, and LR model. In terms of AUC, the AUC value of PSO-XGBoost model is 0.86, which is the highest of the four models. From both ROC curve and AUC value, it can be concluded that the PSO-XGBoost model presented in this paper has the best performance and the highest prediction accuracy for credit risk evaluation of personal auto loans, which is consistent with the results judged by the Sect. “Performance evaluation of PSO-XGBoost model”. Most notably, for the ROC and AUC, Carrington et al. [63] pointed out that in classification and diagnostic tests, ROC and AUC describe how an adjustable threshold causes changes in two types of errors: false positives and false negatives, but the ROC curve and AUC are only partially meaningful when used with unbalanced data. In this sense, if ROC and AUC are used, it is best to first convert unbalanced data sets into balanced data sets. Otherwise, alternatives should be proposed to the ROC curve and AUC. The concordant partial AUC and the partial c statistic for ROC data proposed by Carrington et al. [63] are just good choices.

Conclusion

Seeking to address the problem of credit risk assessment for personal auto loans, this paper studies the feature selection method of credit risk assessment, and constructs a machine learning based credit risk assessment mechanism. Two data sets of personal auto loans on the Kaggle platform are selected as the research samples. Noting the imbalanced data set, the Smote-Tomek Link algorithm is proposed to achieve a balanced data set. An improved Filter-Wrapper feature selection method is then proposed to select the credit risk assessment indexes of the personal auto loans. A PSO-XGBoost model for the credit risk assessment is constructed and an empirical analysis is made.

Moreover, the proposed PSO-XGBoost model is compared with the RF model, XGBoost model, and LR model using the performance evaluation indexes such as accuracy, precision, complexity, ROC curve and AUC value. In the empirical analysis according to the first data set given by Sect. “Data preparation and preprocessing”, the comparison results show that the classification accuracy of the PSO-XGBoost model is 83.11%, which is improved by 4.23%, 2.44%, and 12.45% respectively than that of the RF, XGBoost and LR; In terms of Precision and Recall, the PSO-XGBoost model is also better than RF, XGBoost and LR; The AUC value is 0.9, which is also higher than other three comparison models. From the results of another empirical analysis according to the data set given by Sect. “Further analysis of model performance”, the results also inform the PSO-XGBoost model to be more superior to the other models on classification performance and classification effect. This validates the choice of the PSO-XGBoost model in the credit risk assessment of personal auto loans.

Due to the data set selected in this paper is a set of two-category data, thus the problem discussed in this paper is just a two-category credit risk assessment of personal auto loan. However, in the actual field of personal auto loan, the loan customers can be classified into multiple levels of credit, so that the auto financing institutions can carry out credit business with differentiated strategies for different customers, so as to improve the core competitiveness of the company. Therefore, seeking multi-category data set of personal auto loan and studying on the multi-category credit risk assessment model based on the two-category model established in this paper are the new directions of further research in the future.