Dataset
This research used the level of risk data from the Prudential Life Insurance Company from the Kaggle. The data can be obtained from https://www.kaggle.com/c/prudential-life-insurance-assessment/data. The data show life insurance applicant information that can be used to predict the level of risk.
In the data, there are 59,381 observations or individuals, 128 features that show life insurance applicant attributes consisting of 60 features of categorical variables, 13 continuous variable features, 5 discrete variable features, 48 dummy variable features, 1 Id feature that shows the id for a region, and 1 response feature that shows life insurance risk category. There are also 13 features that contain missing values, which is presented in Table 1.
Table 1 Percentage of missing values Table 1 shows that the highest percentage of missing values is in Medical_History_10 with 99.07% missing values. All data on features that contain missing values are float data type.
Another problem that occurs is an imbalance in the target. Target in the data is response, which shows the level of life insurance risk. The imbalance of the target data can be seen from the amount of observation at each risk level. Table 2 shows the amount of observation at each risk level.
Based on Table 2, there are differences in the amount of data at the 8 levels of risk. There is such a big difference in the amount of observation between risk level 8 and the others. The difference in the amount of data indicates that there is an imbalance problem in the data.
Missing value
A missing value is information that is not available in an object or case. Missing value occurs when information for something about an object is not given, challenging to find, or, indeed, the information is not available. If the historical data of the applicant contain a large percentage of missing values, then it is necessary to handle it.
According to the mechanism of missingness, there are three types of missing values, i.e., missing completely at random (MCAR), when the values of missing data have not related to the values in the observed dataset, missing at random (MAR), when response opportunity of the missing depends on the observed dataset, but not associated with specific expected missing values, and missing not at random (MNAR), when data characteristics are not included in the MCAR and MAR to get an unbiased estimate of parameters such as the case of missing data in the model [8].
Several papers are dealing with the problem of missing values. Little and Rubbin [9] researched missing values with statistical analysis. Zhang et al. [10] used a missing value model that cannot be ignored with the MCMC sampling algorithm in the Bayesian method. Ma and Chen [11] conducted similar research with the Bayesian approach for handling missing values. Dewi et al. [12] researched handling missing values by replacing missing values with 0 (zero), mean values, medians, and values that often arise from data in the same column. The research represents that the XGBoost method can work on data that contain missing values without handling the data. Stephen researched handling missing data in numeric analyses [13]. Wijasekara and Liyanage [14] made a comparison of imputation methods for missing values in air pollution data. Sanjar et al. [15] researched missing data imputation for geolocation-based price prediction using KNN-MCF method.
Imbalance dataset
An imbalance dataset or imbalance in data is one of the classification problems. When the dataset is imbalanced, the classification algorithm does not have sufficient information relating to the minority class to get an accurate prediction [16]. In the multi-class classification case, data imbalance can happen between one and another class. It said to be unbalanced when there are significant differences in data in each class. It can occur when classes that have relatively little data (minority classes) much smaller or less frequent than classes that have the most data (majority classes) [17].
Machine learning and deep learning algorithms are strongly affected by the class imbalance problem. Thus, the class imbalance problem needs to be handled [18]. Imbalance of data between classes in a multi-class classification can be handled with several techniques. One of them is the oversampling technique. Oversampling is a technique used to adjust the class distribution of datasets. One popular oversampling method is the random over sampler, which works by randomly duplicating minority class samples to achieve a balanced distribution in both classes [19].
XGBoost
XGBoost is a method that used the development of the basic gradient tree boosting model to become extreme gradient boosting. Based on a paper that is written by [6], the model used classification and regression tree (CART). CART is a binary decision tree that divides a node into two leaf nodes repeatedly. CART tree formation is done by selecting the best splitting for each node of the tree. For example, given \({\mathbf{D}}\) as follows:
$${\mathbf{D}} = \{ ({\mathbf{x}}_{i} ,y_{i} )\} ,\;\;\;1 \le i \le n$$
(1)
$${\mathbf{x}}_{i} = [x_{ij} ],\;\;\;1 \le j \le m$$
(2)
where \(n\) and \(m\) are the number of applicants and features in the data consecutively, \({\mathbf{x}}_{i}\) is the \(i\) life insurance applicant data, \(y_{i}\) is the level of risk insurance claim of the \(i\) applicant for life insurance or the actual target, and \(x_{ij}\) is the \(i\) individual data with \(j\)th feature.
In the multi-class classification problem, the process of forming a CART is done by grouping the same class in each branch of the tree that is created, starting with a node containing all the data; then, splitting the process continues until it reaches the stopping criteria. In the splitting rule, there is a node a which becomes two vertices left (L) and right (R) with \(s_{ij} \in \{ a_{ij} |i \in I_{a} ,1 \le j \le m\}\) as a splitting point with the results of splitting
$$I_{a} = I_{\text{L}} \cup I_{\text{R}}$$
(3)
$$I_{\text{L}} = \{ i \in I_{a} |a_{ij} < s_{ij} \}$$
(4)
$$I_{\text{R}} = \{ i \in I_{a} |a_{ij} \ge s_{ij} \}$$
(5)
the best determination \(s_{ij}\) is to test all possible combinations of \(s_{ij}\), i.e., the minimum Gini index value for the two vertices \(G_{\text{L}} + G_{\text{R}}\). If the data at the node consist of only one level or a similar level, then the Gini value of a node will be minimum. Gini index is used to determine the purity of a node that is defined by
$$G = 1 - \sum\limits_{i = 1}^{M} {(p_{i} } )^{2}$$
(6)
where \(M\) is a class that has the probability of each \(p_{i}\) where \(i = 1,2, \ldots ,M\).
In gradient tree boosting, the splitting process uses all values in data that are used as splitting points. There is a split determination algorithm for each feature that is the exact greedy algorithm [6].
Problems will arise if missing values occur in the dataset. On XGBoost, it can be handled with a sparsity-aware split finding algorithm that can accurately handle missing values on XGBoost. The algorithm helps in the process of creating a CART on XGBoost to work out missing values directly. Here is the algorithm of sparsity-aware split finding algorithm found on XGBoost [6].
Machine learning
Machine learning is a method used to process data and information to be applied to further data. Machine learning implements various algorithms that iteratively learn to improve data, describe data, and predict results [20]. Several stages need to be done when processing data using machine learning, including vectorization, preprocessing, learning, and evaluation.
Vectorization
Vectorization is the process of representing data in the form of vectors. The process is carried out after the collection of data to be used, which is raw data in the form of Excel, Ms. Access, and text files. After the vectorization process is completed, the data in the form of this vector are identified and processed at a later stage.
Preprocessing
In general, the data that have been collected cannot be handled directly with machine learning because it has several problems, such as data containing missing values, imbalances, or inconsistencies of data. So it is necessary to do preprocessing before the data can be used to make a model.
In this research, data preprocessing is done by changing categorical variables to dummy variables with one hot encoder and standardization for continuous variables. Techniques used for standardization are standard scaler, which is standardization using normal distribution or statistical normalization (Z-score) [21]. This formula can be written as follows:
$$x' = \frac{{\left( {x_{i} - \mu_{i} } \right)}}{{\sigma_{i} }}$$
(7)
where \(\mu_{i}\) is the mean and \(\sigma_{i}\) is the standard deviation of the \(i\)th data.
Learning
Learning is a process of determining the values of the model’s parameters based on data. In the learning process will be performed a fitting model which is a matching machine learning model that is suitable for the problem to be solved. There are two kinds of learning in machine learning. Those are unsupervised learning and supervised learning.
In supervised learning, there was a target feature that is included in training data so that each of the training data is in the form of data pairs. The purpose of supervised learning is to make a model that can give output that best suits the target for all training data [22]. There are two kinds of problems that can be solved using supervised learning, i.e., classification and regression. Classification will produce output in the form of classes or categories to classify data accurately, while regression will produce output in the form of continuous or real values. In unsupervised learning, there is no target feature in training data. Unsupervised learning builds a model that can describe hidden structures in data. The problems included in this research are clustering and dimensionality reduction.
The problem in this research is a kind of supervised learning problem, which is a multi-class classification. The dataset in this research contains input and output pairs that show historical data for life insurance applicants, and the target is the class of the life insurance applicant’s risk.
Evaluation
In evaluation, the model accuracy estimation process is evaluated using cross-validation to see how well the model fits the data.
Risk assessment as a machine learning problem
Life insurance risk assessment is a multi-class classification problem that is included in supervised learning. The performance of supervised learning can be measured using evaluation metrics. The evaluation metric that is used in this research is a confusion matrix. Confusion matrix is a performance measurement of the classification model. In a confusion matrix, the predicted class will be compared with the actual class. Each column in the matrix shows the predicted results for the class corresponding to the column, while each row shows the actual class.
Figure 1 shows the confusion matrix for multi-class classification, where ck is a positive class, and other than ck is a negative class. TN (true negative) is the frequencies of results where the model predicts negative classes correctly. FP (false positive) is the frequencies of results where the model incorrectly predicts a positive class. TP (true positive) is the frequencies of results where the model predicts positive classes correctly. FN (false negative) is the frequencies of results where the model incorrectly predicts a negative class.
Confusion matrix is used to calculate some performance metrics, such as accuracy, precision, and recall. Accuracy is an evaluation metric to measure the total number of predictions a model gets right. The formula for accuracy is given below:
$${\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{FP}} + {\text{TN}} + {\text{FN}}}}.$$
(8)
Precision evaluates how precise a model is in predicting positive labels. Precision is the percentage of the results which are relevant. The formula for precision can be written as follows:
$${\text{Precision}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FP}}}}.$$
(9)
Recall calculates the percentage of actual positives a model correctly identified (true positive). The formula for the recall is given below:
$${\text{Recall}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FN}}}}.$$
(10)