An application of a supervised machine learning model for predicting life expectancy

Lipesa, Brian Aholi; Okango, Elphas; Omolo, Bernard Oguna; Omondi, Evans Otieno

doi:10.1007/s42452-023-05404-w

An application of a supervised machine learning model for predicting life expectancy

Research
Open access
Published: 16 June 2023

Volume 5, article number 189, (2023)
Cite this article

Download PDF

You have full access to this open access article

SN Applied Sciences Aims and scope Submit manuscript

An application of a supervised machine learning model for predicting life expectancy

Download PDF

Brian Aholi Lipesa¹,
Elphas Okango¹,
Bernard Oguna Omolo^1,2 &
…
Evans Otieno Omondi¹

2918 Accesses
1 Citation
Explore all metrics

Abstract

The social and financial systems of many nations throughout the world are significantly impacted by life expectancy (LE) models. Numerous studies have pointed out the crucial effects that life expectancy projections will have on societal issues and the administration of the global healthcare system. The computation of life expectancy has primarily entailed building an ordinary life table. However, the life table is limited by its long duration, the assumption of homogeneity of cohorts and censoring. As a result, a robust and more accurate approach is inevitable. In this study, a supervised machine learning model for estimating life expectancy rates is developed. The model takes into consideration health, socioeconomic, and behavioral characteristics by using the eXtreme Gradient Boosting (XGBoost) algorithm to data from 193 UN member states. The effectiveness of the model’s prediction is compared to that of the Random Forest (RF) and Artificial Neural Network (ANN) regressors utilized in earlier research. XGBoost attains an MAE and an RMSE of 1.554 and 2.402, respectively outperforming the RF and ANN models that achieved MAE and RMSE values of 7.938 and 11.304, and 3.86 and 5.002, respectively. The overall results of this study support XGBoost as a reliable and efficient model for estimating life expectancy.

Article highlights

This work develops a model that takes into consideration health, socioeconomic, and behavioral characteristics in predicting life expectancy to address the limitations of traditional life expectancy models, such as the ordinary life table.
The comparison of the effectiveness of the model’s prediction with that of the Random Forest (RF) and Artificial Neural Network (ANN) regressors is performed.
The reliability and efficiency of XGBoost in estimating life expectancy is assessed and established to be superior to Random Forest model, with implications to societal issues and global healthcare administration.

Predicting Bangladesh Life Expectancy Using Multiple Depend Features and Regression Models

Evaluating Models for Better Life Expectancy Prediction

Application of machine learning methods for predicting under-five mortality: analysis of Nigerian demographic health survey 2018 dataset

Article Open access 25 March 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The Organisation for Economic Cooperation and Development, the author in [1] defines life expectancy (LE) at birth as how long on average, a newborn can expect to live, if current death rates remain constant. LE is an estimate of the average number of additional years that a person of a given age can expect to live. Precisely, it tells us the average age of death in a population. Estimates suggest that in the ancient times, LE was around 30 years in all regions of the world [2]. Prior to the COVID-19 pandemic, population health was improving globally, increasing the global average LE at birth from 66.8 years in 2000 to 73.3 years in 2019. In 2019, LE for males reached 70.9 years. For females, the equivalent figure was 75.9 years. The African Region still had the lowest LE among WHO regions in 2019, at only 64.5 years, despite experiencing the largest gains over the past two decades. The Region of the Americas had the highest LE (74.1 years) in 2000 but dropped to third place (77.2 years) in 2019 as the European and Western Pacific Regions made accelerated gains, reaching 78.2 and 77.7 years, respectively [3].

At the UN Sustainable Development Summit in September 2015, world leaders adopted the Sustainable Development Goals (SDGs) as the symbol of the global agenda for development through 2030. These goals are a set of 17 commitments made by the 193 world leaders, for fair and sustainable health at every level. Their aim is to end extreme poverty, inequality and climate change by the year 2030 [4]. Ensuring healthy lives and promotion of the well-being for the population of the world for all at all ages is vital to sustainable development [5]. Life expectancy is a key measure often used to assess the overall well-being and health of a population. It is an indicator of a country’s overall health, reflecting its social and economic conditions, as well as the quality of its public health and healthcare infrastructure, among other factors [6]. Wang et al. [7] notes that the fundamental goal of a health system is to prolong healthy life into advanced age.

A slight growth in LE translates into sizeable increases in the population. Lengthening a population’s LE tends to increase the number of individuals living with chronic illnesses, a common feature among the elderly. Thus, the ability to project how populations will age has massive implications on the planning, support and service provision of a Nation. The computation of life expectancy involves building an ordinary life table [8]. The life table is a population model covering the simple case of a birth cohort, born at the same time period, closed to migration, and followed through successive ages until they die [9]. The life table assumes the homogeneity of cohorts (i.e., subjects have the same distribution of survival times) [10]. In life tables, the LE of persons alive at age x is computed as:

$$\begin{aligned} \displaystyle \text {e}_x=\frac{T_x}{l_x}, \end{aligned}$$

where $T_x$ is the total number of person-years lived by the cohort from age x until all members of the cohort have died and $l_x$ is the number of persons alive at age x [9]. Since the use of life tables takes a long time following through successive ages till individuals’ departure from life, the resultant estimates from this approach may not be applicable to the current populations due to the current technological and other factors affecting LE. The homogeneity assumption of life tables is not very realistic in practice.

Furthermore, in practice, individuals are prone to censoring either by death or out-migration resulting into biasness of the estimates from life tables. As a result, a robust and more accurate approach is inevitable. Raftery et al. [11] developed a Bayesian Hierarchical Model (BHM) for probabilistic life expectancy forecasts for all nations across the world. The model’s resultant performance was compared with the UN methodology. The 10-year out-of-sample predictions for the BHM model attained a mean absolute error (MAE) and a standard absolute prediction error (SAPE) of 1.07 and 1.04, respectively outmatching the UN approach that had initially reported a MAE of 1.86. Meshram [12] adopted the random forest (RF) regressor to compare life expectancy between advanced and developing economies, achieving MSE and MAE values of 4.43 and 1.58, respectively. The back-propagation artificial neural network (ANN) technique was employed by [13] to forecast life expectancy in Maluku province, Indonesia. The findings from this research reported a mean absolute percentage error (MAPE) of 0.0035 and an average accuracy of 99.65%. The downside of the above three major approaches discussed is that:

(a)
The Bayesian approach requires a prior distribution even when very limited prior information is available.
(b)
The random forest regressor may be unreliable and slower for real-time predictions in settings with high number of trees [14].
(c)
The back-propagation ANN model was evaluated based on 5 and 2 training and testing samples, respectively. This amount of data is considerably small for training neural networks.

The study in this paper proposed the use of eXtreme Gradient Boosting (XGBoost) algorithm to predict life expectancy, considering health, socioeconomic and behavioral factors. Specifically, the study sought to: (a) identify and select the predictors considered important in contributing to life expectancy; and (b) assess the predictive performance of the developed model based on life expectancy data. XGBoost has been widely used to produce state-of-the-art results on many machine learning problems [15]. XGBoost is competitive in terms of its ability to: handle missingness, build machine learning models more quickly and add built-in regularization to achieve accuracy gains [16]. These features render XGBoost to be a faster, more accurate, and a more desirable algorithm over other comparable ensemble algorithms, for the dataset at hand.

The remaining sections of this paper are structured as follows: Sect. 2 presents the methods and materials employed, including a description of the dataset used in this study. Section 3 details the findings from this study and compares the proposed life expectancy prediction model with the RF and ANN models used in earlier research. Section 4 discusses the prediction model and experimental outcomes in relation to existing research methodologies. Section 5 wraps up the paper with some suggestions for future research.

2 Methods and materials

2.1 Data source and study variables

The data for this study were sourced from the databases of the World Health Organization (WHO) and the United Nations (UN). The data covered 193 UN member states from the year 2000–2015 and comprised LE health-related factors drawn from the Global Health Observatory data repository, while the UN data repository provided the corresponding socioeconomic-related factors for the 193 countries. After merging the individual data files, the resultant dataset contained 2937 observations of 22 variables. However, during the data pre-processing phase, it was discovered that the data values for the “population” variable were incorrect, and they were replaced with new values from the World Bank Development Indicators’ dataset. Furthermore, the “percentage expenditure on health as per GDP” variable was found to have erroneous values and was excluded from the study. Due to missing data for a considerable number of years, the data for Niue, San Marino, Cook Islands, Marshall Islands, Monaco, Palau, Tuvalu, and Dominica were excluded from this study. After data cleaning, the final dataset used in this study consisted of 2832 observations of 21 variables.

These data cleaning measures were taken to ensure the accuracy and reliability of the dataset used in our study. Tables 1 and 2 presents the description of the data in terms of its summary statistics and frequencies, respectively. The detailed definition of the study variables is as presented in Table 3.

Table 1 Summary statistics for continuous variables

Full size table

Table 2 Summary frequencies for categorical variables

Full size table

2.2 Definition of variables

2.3 Data splitting and model validation metrics

Data splitting is a routinely employed technique for model validation where a given dataset is partitioned into two disjoint sets: training and testing. The proposed model in this study was applied to a real-life dataset. The dataset was randomly split into a training and testing set following the commonly applied ratio of 80 and 20%, respectively [17, 18]. The training set was used for the model development and the hold-out (testing) set employed for model validation. Automatic random selection was used to avoid the introduction of potential figure selection bias via use of the caret [19] package in R. The study used R version 4.2.0 program from [20] as the primary software for all the statistical computations and analyses.

Table 3 Definition of the study variables

Full size table

As Bakas and Kontoleon [21] recommends, after the model was developed, trained and found to be functional, its performance was evaluated before proposing its use for industry or further scientific research. Due to their commonness in measuring accuracy for continuous variables and their nature of being quick and easy to calculate, the study employed the Root Mean Squared Error (RMSE) and the Mean Absolute Error (MAE) to examine the accuracy of the resultant prediction model. RMSE is naturally influenced by outliers since squaring the errors in the MSE gives extreme values (usually having higher errors than other samples) more attention and dominance in the final error, thereby impacting the model parameters [22]. MAE has been known to be more robust to outliers [23]. To address the limitation of RMSE, the study used the MAE as a complementary measure for model validation.

RMSE essentially finds the square root of the average squared error between the actual value and the predicted value by the model. It is a standard way to measure the error of a model in numerical predictions. It is defined as:

$$\begin{aligned} \displaystyle \text {RMSE}=\displaystyle \sqrt{\frac{1}{n}\sum _{i=1}^n \left( y_i-{\hat{y}}_i\right) ^2} \end{aligned}$$

(1)

where $y_i$ is the ith observed value of the response for the data, ${\hat{y}}_i$ is the corresponding ith predicted value using the fitted model and the predictors from the data, and n is the number of observations. Lower RMSE values indicate a good fit [23].

Mean Absolute Error (MAE) (or the mean absolute deviation) finds the average absolute distance between the target value and the predicted value. It is defined as:

$$\begin{aligned} \displaystyle \text {MAE}=\frac{1}{n}\left[ \sum _{i=1}^n \left| y_i-{\hat{y}}_i\right| \right] \end{aligned}$$

where, $y_i,\,{\hat{y}}_i,\,$ and n are explained in (1). $\left| y_i-{\hat{y}}_i\right| $ is the absolute error. A lower MAE value is an indication of a good fit.

2.4 Study design and population

This study followed a cross-sectional design where information on 21 study variables for 193 UN member states from the year 2000 to 2015 were obtained. The study population were the UN member states that had data for all the years from 2000 through 2015. Countries that completely missed data on any particular year were excluded from the study. As of the year 2022, the Worldatlas states that there were 195 countries in the world. The study population was based on the readily available data for the 193 countries that represent nearly the entire globe. The data partitioning was based on the split ratio.

2.5 Statistical analysis

In order to convert the raw data into a meaningful and efficient format, data pre-processing was undertaken in the preliminary phase of the analysis. Data was inspected to check for missingness and structure of the variables, and appropriately cleaned. Summary statistics were employed to establish the measures of central tendency. EDA was applied to understand the underlying structure of the dataset and uncover useful insights that were relevant for this study. Spearman’s rank correlation coefficient was used to measure the associations between study variables as explained in [24], and the resultant correlation matrix visualized through a heatmap via the corrplot package in R. A multivariate normality check was carried out using the MVN package in [25] to ascertain whether the data was drawn from a normally distributed population.

The multivariate normality assumption was violated. Thus, an assessment of whether or not the Regions and Income Groups had different mean vectors across the different study variables was conducted using the Extended Multivariate Kruskal–Wallis (E-MKW) test with missing data put forth by [26] in their paper. Data for the year 2015 for all the countries was reserved for making predictions. The remaining data was split into two disjoint partitions for the train and test sets by use of the caret package [19]. The baseline model was initialized using the default XGBoost hyperparameters. Subsequently, the hyperparameters were fine-tuned via a grid search cross-validation to arrive at an optimal set of hyperparameters used in developing the final XGBoost model.

2.6 Feature scaling and missingness

Feature scaling (also known as data normalization) is a process to standardize the variables present in a dataset to a constant scale [27]. Prior to the selection of variables, feature scaling was undertaken on the numeric variables using the Z-Score technique. The formula used for computing the Z-Score of a data point, x, was as follows:

$$\begin{aligned} \displaystyle x'=\frac{(x-\mu )}{\sigma }, \end{aligned}$$

where $\mu $ and $\sigma $ are the mean and standard deviation of the feature vector, respectively. The predictors were selected through principal component analysis (PCA) where, the variables with significant contribution to the first principal component were chosen. Since the position of the split point is unaffected by feature scaling in XGBoost, there was no scaling applied to the chosen predictors for the XGBoost model [28]. As a preliminary step prior to performing PCA, the missing values in the life expectancy dataset were imputed with the Principal Components Analysis model via the missMDA package [29]. missMDA imputes the incomplete data set in such a way that the imputed values will not have any weight on the results of PCA. The missing values were predicted using the iterative PCA algorithm with two dimensions being used to predict the missing entries.

2.7 eXtreme gradient boosting

eXtreme Gradient Boosting, abbreviated as XGBoost is a decision-tree based ensemble machine learning algorithm that uses a gradient boosting framework proposed by Chen and Guestrin [15] in 2015. It is a novel classification and regression problems implementation algorithm frequently applied due to its rapidness, efficiency and scalability [30]. In gradient boosting, the function that determines the ith row prediction consists the sum of all previous functions. Suppose there are K boosted trees, mathematically, the XGBoost model will be in the form:

$$\begin{aligned} \displaystyle {\hat{y}}_i=\sum _{k=1}^K f_k(x_i),\,\, f_k\in F, \end{aligned}$$

where ${\hat{y}}_i$ is the predicted value by the ML model for the ith row, K is the number of boosted trees, $x_i$ is the ith data point (a vector whose entries are the columns of the ith row), f is a function in the functional space F, and F is the set of all possible Classification And Regression Trees (CARTs). Similarly, the kth boosted tree prediction will be defined as:

$$\begin{aligned} \displaystyle {\hat{y}}_{i}^{(k)}=\sum _{k=1}^K f_k (x_i). \end{aligned}$$

(2)

The trees are trained by defining an objective function and optimizing it. The objective function to be optimized for the kth boosted tree is given by:

$$\begin{aligned} \displaystyle Obj^{(k)}=\sum _{i=1}^N L(y_i, {\hat{y}}_{i}^{(k)})+\sum _{k=1}^K\Omega (f_k), \end{aligned}$$

(3)

where $L(y_i, {\hat{y}}_{i}^{(t)})$ is the training loss function (i.e., MSE) of the kth boosted tree and $\Omega (f_k)$ is the ith regularization term, a penalty term to prevent over-fitting [16]. The loss and regularization functions are derived based on [15] to come up with the learning objective function. The general architecture of the XGBoost algorithm is as shown in Fig. 1.

Since gradient-boosted trees sum the predictions of the previous trees, in addition to the prediction of the new tree, it follows from (2) that:

$$\begin{aligned} \displaystyle {\hat{y}}_{i}^{(k)}={\hat{y}}_{i}^{(k-1)}+ f_k(x_i), \end{aligned}$$

(4)

which is the idea behind additive training. Substituting (4) into the preceding learning objective in (3), we obtain:

$$\begin{aligned} \displaystyle Obj^{(k)}=\sum _{i=1}^N L \left( y_i, {\hat{y}}_{i}^{(k-1)}+f_k(x_i)\right) +\Omega (f_k). \end{aligned}$$

The above equation can be re-written as follows:

$$\begin{aligned} \displaystyle Obj^{(k)}=\sum _{i=1}^N \left[ y_i-({\hat{y}}_{i}^{(k-1)}+f_k(x_i))\right] ^2+\Omega (f_k). \end{aligned}$$

Multiplying the polynomial out, the objective function becomes:

$$\begin{aligned} \displaystyle Obj^{(k)}=\sum _{i=1}^N \left[ 2 \left( y_i-{\hat{y}}_{i}^{(k-1)}\right) f_k(x_i)+f_k(x_i)^2\right] +\Omega (f_k)+C, \end{aligned}$$

where C is a constant term independent of k. The goal was to find the optimal value of $Obj^{(k)}$, the optimal function mapping the samples (roots) to the predictions (leaves). XGBoost uses the Newton Rhapson’s Method with a second-order Taylor expansion to get the following:

$$\begin{aligned} \displaystyle Obj^{(k)}=\sum _{i=1}^N \left[ g_if_k(x_i)+\frac{1}{2}h_if_k(x_i)^2\right] +\Omega (f_k), \end{aligned}$$

where $g_i$ and $h_i$ are the first and second partial derivatives of the loss function, respectively defined as:

$$\begin{aligned} g_i&=\frac{\partial l(y_i, {\hat{y}}_{i})}{\partial {\hat{y}}_i}=\frac{y_i}{{\hat{y}}_i}-\frac{1-y_i}{1- {\hat{y}}_i}=\frac{y_i(1-{\hat{y}}_i)-{\hat{y}}_i(1-y_i)}{{\hat{y}}_i(1-{\hat{y}}_i)}\\ &= \frac{y_i-y_i {\hat{y}}_i-{\hat{y}}_i+y_i {\hat{y}}_i}{{\hat{y}}_i(1-{\hat{y}}_i)} = \frac{y_i-{\hat{y}}_i}{{\hat{y}}_i(1-{\hat{y}}_i)},\\ h_i&=\frac{\partial ^2 l(y_i,{\hat{y}}_i)}{\partial {\hat{y}}_i^2}=\frac{\partial }{\partial {\hat{y}}_i}g_i=\frac{\partial }{\partial {\hat{y}}_i}\left[ \frac{y_i-{\hat{y}}_i}{{\hat{y}}_i(1-{\hat{y}}_i)}\right] \\&=\frac{y_i-1}{{({\hat{y}}_i-1)}^2}-\frac{y_i}{{\hat{y}}_i^2}\,. \end{aligned}$$

Having introduced the training step, the complexity of the tree $\Omega (f_k)$, is defined. Let w be the vector space of leaves. Then f, the function mapping the root of the tree to the leaves can be given a different form in terms of w as follows:

$$\begin{aligned} \displaystyle f_k(x)=w_{q(x)}, \quad w\in R^T, \quad q:R^d\rightarrow \,\left\{ 1,2,\ldots ,T\right\} , \end{aligned}$$

where w is the vector of scores on leaves, q is the function assigning each data point to the corresponding leaf, and T is the number of leaves. In XGBoost, the regularization term is defined as shown in (5), where $\gamma $ and $\lambda $ are penalty constants to reduce overfitting [16]:

$$\begin{aligned} \displaystyle \Omega (f_k)=\gamma T +\frac{1}{2}\lambda \sum _{j=1}^T w_j^2. \end{aligned}$$

(5)

Combining the loss function with the regularization term yields the learning objective function as:

$$\begin{aligned} \displaystyle Obj^{(k)}=\sum _{i=1}^N \left[ g_i w_{q(x_i)}+\frac{1}{2}h_i w_{q(x_i)}^2\right] +\gamma T +\frac{1}{2}\lambda \sum _{j=1}^T w_j^2, \end{aligned}$$

which is the result XGBoost uses to determine how well the model fits the data.

3 Results

3.1 Overall findings

Results from the Extended Multivariate Kruskal–Wallis (E-MKW) rank sum test revealed significant differences in the mean vectors of the numeric variables across the regions (EM Kruskal–Wallis chi-squared = 2615.833, df = 220.4847, p-value = 0). Similarly, there were significant differences among the variables across the three income groups (EM Kruskal–Wallis chi-squared = 1128.493, df = 73.49491, p-value = 1.448515e−188). The populations of the richest countries in the world have life expectancies of over 60 years with the North American countries taking the lead with an average life expectancy of 79.88 years. The countries in the Sub-Saharan Africa region have the lowest life expectancy rates with a mean of 57.12 years. The details of the respective regional mean life expectancy rates are as highlighted in Table 4.

Table 4 Average life expectancy rates in years per region

Full size table

3.2 Key determinants of life expectancy

Fig. 2a visualizes the contribution of study variables to the first principal component. Principal component one accounted for 32.6% of the variability in the dataset. The anticipated average contribution of the variables is shown by the red dashed line (6.25%).

If the research variables contributed equally, the expected average value depicted as the cutoff in Fig. 2a would be arrived at as shown in (6). A variable is deemed essential for contributing to a specific principal component if its contribution exceeds this threshold [31].

$$\begin{aligned} \displaystyle \text {Expected Average Contribution}\,(\%)= & {} \left[ \frac{1}{\text {No. of Variables}}\right] \%\nonumber \\= & {} \frac{1}{16} \times 100=6.25\%. \end{aligned}$$

(6)

From Fig. 2a, the percent of thinness among children aged 5–9 and 10–19, number of years at school, average body mass index, number of under-five deaths and infant deaths per 1000 population were important in contributing to the first principal component. These six variables exceed the threshold of 6.25%, hence considered important. As the research variables for further analysis, the six variables as well as region and income group were chosen. The importance of the variables in the final model based on the gain measure is as depicted in Fig. 2b. Regional location, number of years at school, income group, number of under-five deaths per 1000 population, percent of thinness among children aged 5–9 and the average BMI are the key determinants of life expectancy, respectively.

3.3 Association of life expectancy and its selected predictors

Fig. 3 visualizes the Spearman rank correlation matrix heatmap between life expectancy and the six significant predictors. Percent of thinness among children aged 5–9 and 10–19, the number of under five deaths, and the number of infant deaths are negatively associated with LE. The number of years spent in school and the average BMI are positively correlated with LE. The percent of thinness among children aged 5–9 ($t = -26.701$, df = 2654, p-value < 2.2e−16) and 10–19 ($t = -27.092$, df = 2654, p-value < 2.2e−16) were significantly correlated with LE.

Similarly from Fig. 3, the number of years at school and life expectancy were significantly correlated ($t = 58.709$, df = 2654, p-value < 2.2e−16). Furthermore, the average BMI (t = 35.733, df = 2654, p-value < 2.2e−16), number of under five deaths ($t = -10.72$, df = 2654, p-value < 2.2e−16) and infant deaths (t = $-$9.3523, df = 2654, p-value < 2.2e−16) per 1000 population were established to have a significant correlation with life expectancy. It is evident from Fig. 4a that life expectancy improves as the percent of thinness among children aged 5–9 years reduces.

The scatter plot in Fig. 4b depicts a general trend of increasing life expectancy rates with a decrease in the percent of thinness among children aged 10–19 years. Low income countries seem to have higher percentages of thinness among 10–19 year olds in comparison to their middle and high income counterparts. From Fig. 5a, it is evident that life expectancy increases with the number of years spent in school across all the seven regions.

High income countries have a majority of their populations spending more years in school compared to middle and low income countries. Figure 5b suggests an increase in life expectancy rates with average body mass index of the entire population. Majority of the countries in the Sub-Saharan Africa and South Asia have BMI within the healthy and underweight ranges. Most of the populations in Europe and Central Asia, Latin America and Caribbean, and Middle East and North Africa regions are obese.

Notably from Fig. 6a, life expectancy improves as the number of under five deaths per 1000 population decreases. High income countries appear to have the lowest numbers of under-five deaths. Low income countries in the Sub-Saharan Africa and South Asian regions have higher numbers of under-five deaths.

It can further be observed from Fig. 6b that life expectancy increases with a decrease in the number of infant deaths per 1000 population. High income countries seem to experience the lowest numbers of infant mortality across all the regions. On the other hand, low income countries in the Sub-Saharan Africa and South Asian regions, and some of the middle income countries in East Asia and Pacific have higher numbers of infant mortality. As a general observation, life expectancy seems to be higher in the high income countries and lower in the low income countries across all the regions of the world.

3.4 Model performance results

The performance of the XGBoost model was compared with that of two other prominent machine learning models, Random Forest and Artificial Neural Networks (ANN) as shown in Table 5. For the RF model, the grid search optimization technique was used to test 10 different sets of hyperparameters for each of the following hyperparameters: n_estimators, max_depth, min_samples_split, min_samples_leaf, and max_features. As for the ANN model, it was defined with one hidden layer consisting of 20 neurons, and the input layer took in 8 predictor variables. The output layer consisted of one neuron. The model was compiled using mean squared error (MSE) as the loss function, root mean square propagation (RMSprop) as the optimizer with a learning rate of 0.005, and mean absolute error (MAE) as the performance metric. The model was trained for 100 epochs with a batch size of 32 and a validation split of 20%.

RMSprop was used as the optimizer since it is known to improve the convergence rate and generalization performance of the model. XGBoost model outperformed the ANN and RF models, respectively. The XGBoost and RF models were efficient in terms of the model training run-time. On this front, the ANN model performed poorly.

Table 5 RF, ANN & XGBoost models’ predictive performance results

Full size table

Figs. 7a and 7b compare predicted and actual values in terms of life expectancy and mean life expectancy for the year 2015, respectively, and show that there is little fluctuation.

4 Discussion

On 80% of the dataset, the XGBoost algorithm was utilized to create the model. The default hyperparameter values were used to create the baseline model. The gbtree booster method was used, with 0.3 for learning rate (eta), 6 for maximum tree depth (max_depth), 0 for gamma, 1 for minimum child weight (min_child_weight), 1 for row subsample ratio (subsample), and 1 for column subsample ratio (colsample_bytree) as tree booster parameter values. Regression with squared loss objective (reg:squarederror) was employed for the learning task parameter. A total of 100 trees were included in the model. The grid search optimization method was used to test 10 different sets of hyperparameters for each of the following XGBoost hyperparameters: eta, gamma, max_depth, min_child_weight, subsample, colsample_bytree, and nrounds.

After testing all possible combinations of these hyperparameters, the final set of hyperparameters that resulted in the highest accuracy and lowest error rate on the validation set were determined to be: eta = 0.3, gamma = 0, max_depth = 7, min_child_weight = 1, subsample = 1, colsample_bytree = 1, and nrounds = 500. These hyperparameters were then used to train the final XGBoost model in this study. The use of these optimized hyperparameters is believed to have improved the accuracy and reliability of the results. A 10-fold cross-validation with two partitioning repeats yielded the best bias-variance trade-off. For model training efficiency, 6 cores were used to run cross-validation in parallel. After partitioning, the remaining 20% of the dataset was used to evaluate the model. The model accuracy was assessed using RMSE and MAE.

Regional location, number of years at school, income group, number of under-five fatalities per 1000 population, and the percent of thinness among children aged 5–9 years were shown to be the major drivers of life expectancy in this study. These findings are consistent with earlier research by Kaplan et al. [32] and Luy et al. [33] who established that rising educational levels were a determinant of rising life expectancy. According to Szwarcwald et al. [34], life expectancy at birth for women and men living in the wealthiest regions was 5 years higher than for those living in the poorest regions. Nestorovska and Levkov [35] discovered that gross national income had a favorable and statistically significant influence on life expectancy. Miladinov [36] established that a country’s population health and socioeconomic development had a significant impact on life expectancy at birth.

Across all regions, the current study finds that life expectancy is greater in high-income nations and lower in low-income ones. These findings are explained by the fact that nations in the richest geographical areas, such as North America, have longer life expectancies. This is due to increased economic growth in such regions, as a result of better health care, social well-being, industrialization, and educational attainment levels. Across all seven regions, life expectancy rises as the number of years spent in school increases. When compared to medium and low income nations, high income countries have a majority of their inhabitants spending more years in school. This is because having a high level of literacy allows people to make better life decisions, such as having more job prospects, having more negotiating power in terms of remuneration, and having healthy food and lifestyle habits, to name a few. This elevates the standard of living for citizens in a country.

As a metric of economic development, improved GDP per capita boosts life expectancy at birth through promoting economic growth and development in a country, resulting in a longer lifespan. High-income nations outperform their middle and low-income peers in terms of economic growth and development. This is due to increasing economic growth, higher living standards, and improved health in the first world countries. Life expectancy rises when the number of under five deaths per 1000 people drops. High-income nations tend to have the fewest deaths among children under the age of five. This is due to increased access to better healthcare in advanced economies, particularly prenatal and postnatal care, as well as dietary issues. As a result, the risk of death for children under the age of five is quite low in these nations. Low-income nations, on the other hand, have a higher rate of mortality among children under the age of five. The highest estimations are seen in Sub-Saharan Africa and South Asia. This is due to low-income nations’ lag in terms of economic growth, employment rates, access to better healthcare, improved sanitation, and overall living standards.

As the percentage of thinness among children aged 5–9 years decreases, life expectancy increases. High-income countries have lower percentages of children in this cohort, with North American countries having the lowest percentages. Low-income countries, on the other hand, continue to have higher rates of childhood thinness, with Sub-Saharan Africa and South Asia topping the list. Thinness is connected to medical, societal, and economic difficulties [37]. Poor nutrition, caused by insufficient food and beverage consumption, a lack of available food and drink, chronic famine and food insecurity, is a key contributor to this occurrence, particularly in Sub-Saharan Africa and South Asia. The situation in high-income countries may be explained by the ambition of young girls to achieve a dreamy beauty of thinness, shown by the beauty modelling business [38].

The study presented by Pisal et al. [39] aims to predict life expectancy in the Asian population using tree classifier models, namely, J48, Random Tree, and Random Forest. In contrast, the authors of this paper employ an XGBoost regressor to estimate life expectancy globally. The results by Pisal et al. [39] reveal that the Random Forest model achieves the highest accuracy with 84% accuracy using 10-fold cross-validation. Furthermore, the study identifies significant predictors of life expectancy, including socioeconomic factors, educational status, health conditions, and infectious diseases. Chen and Cheng [40] proposed a linear mean residual life model and developed inference procedures for handling potential censoring. The study conducted simulations to assess the finite sample properties of the proposed methods. However, the paper lacks details on the illustration of the efficiency of the proposed approaches. While the linear mean residual life model is a valuable tool in certain contexts, it may not always be appropriate for the data at hand.

While the study by Shang [41] presents useful insights into the prediction of life expectancy using model averaging approaches, its limitations include a lack of information about the actual model evaluation metric values used and a comparative investigation limited to principal component approaches and uni-variate methods. The study also highlights the challenges of accurately predicting life expectancy, especially in different demographic groups and regions. In their research, Raftery et al. [11] proposes a Bayesian hierarchical model (BHM) to probabilistically project life expectancy for all countries worldwide, comparing its performance with the UN methodology. The BHM model achieved a mean absolute error (MAE) and standard absolute prediction error (SAPE) of 1.07 and 1.04, respectively, outperforming the UN approach which reported a MAE of 1.86 for 10-year out-of-sample predictions. However, this study has a number of limitations. Firstly, it excludes countries with a generalized HIV/AIDS epidemic, potentially affecting the generalizability of the findings. Secondly, the analysis focuses solely on forecasting life expectancy for males, limiting the ability to assess the model’s robustness. Additionally, the Bayesian approach employed in this study requires a prior distribution, which can be a disadvantage when external information is scanty.

The paper presented by Dias et al. [42] investigated the impact of sex, death cause, profession, and race on life expectancy in the Colombo district of Sri Lanka using generalized linear models (GLMs) and Kaplan-Meier estimates. The authors found that a univariate GLM could predict an individual’s lifetime, and that race, cause of death, and profession had no effect on lifespan. However, the corrected GLM had poor fit with low $R^2$ and adjusted $R^2$ values, indicating underfitting. The study was limited by its narrow focus on predictors and the need for a more robust modeling technique. In contrast, the current study aimed to broaden the scope of predictors and improve the accuracy as well as the generalizability of life expectancy predictions by incorporating other relevant factors, while employing a robust algorithm.

The present study demonstrates the competitiveness of the XGBoost algorithm with regard to its ability to handle missing data, build the model more quickly, and achieve superior accuracy gains over other comparable ensemble algorithms applied on life expectancy datasets. On the test set, the XGBoost model attained MAE and RMSE values of 1.554 and 2.402, respectively. The number of optimal iterations attained was 500, with a learning rate of 0.3 and a maximum tree depth of 7. These findings outperform those published by Meshram [12], that used the random forest regression model achieving an MAE value of 1.58. These results support the XGBoost regressor as a reliable and efficient model for estimating life expectancy across the globe.

5 Conclusion

Other than the research variables employed in this work, this study recognizes that other measures of health and environmental issues are imperative in the contribution to a population’s lifespan. Quality-adjusted life years (QALYs), disability-adjusted life years (DALYs), as well as pollution and climate change have an influence on life expectancy. However, their inclusion in the model was beyond the scope of this study. Secondly, data for all the factors included in this analysis was not easily obtainable for the years 2016 through 2021. As a result, the prediction model does not account for current events impacting life expectancy that may have happened in the UN member nations.

The results presented in this paper are by no means the best representation of life expectancy. Despite the shortcomings highlighted above, the work presents some interesting findings that are comparable to research done by others. Moreover, the results of the mathematical model are useful in assisting in the design of better life expectancy prediction strategies that can lead to the improvement of people’s well-being. Future studies may be explored by integrating other quality of life measures (such as QALYs and DALYs), and environmental components in the prediction model, in addition to updating the model with fresh data.

Availability of data and materials

The data used to support the findings of this study were obtained from Global Health Observatory (GHO), a public database accessible from GHO [43]. The data are open source with no restriction and have been deposited and archived in Dryad repository available at Omondi et al. [44]. The codes are available online at Omondi et al. [45, 46].

References

OECD (2022) Health status: life expectancy at birth—OECD data, March 2022. https://data.oecd.org/healthstat/life-expectancy-at-birth.htm
Roser M, Ortiz-Ospina E, Ritchie H (2013) Life expectancy, May 2013. https://ourworldindata.org/life-expectancy
World Health Organization (2021) World health statistics 2021: monitoring health for the SDGs, sustainable development goals. The Global Health Observatory, pp 1–121. https://apps.who.int/iris/bitstream/handle/10665/342703/9789240027053-eng.pdf
Global Goals (2022) The global goals, February 2022. https://www.globalgoals.org/
UN (2021) The sustainable development goals report. https://unstats.un.org/sdgs/report/2021/The-Sustainable-Development-Goals-Report-2021.pdf
Ho JY, Hendi AS (2018) Recent trends in life expectancy across high income countries: retrospective observational study. bmj 362:k2562
Wang H, Naghavi M, Allen C, Barber RM, Bhutta ZA, Carter A, Casey DC, Charlson FJ, Chen AZ, Coates MM et al (2016) Global, regional, and national life expectancy, all-cause mortality, and cause-specific mortality for 249 causes of death, 1980–2015: a systematic analysis for the global burden of disease study 2015. The Lancet 388(10053):1459–1544
Article Google Scholar
Ayuso M, Bravo JM, Holzmann R (2021) Getting life expectancy estimates right for pension policy: period versus cohort approach. J Pens Econ Financ 20(2):212–231
Article Google Scholar
Wunsch G, Mouchart M, Duchene J (2002) The life table: modelling survival and death. In: European studies of population, vol 11, 1 edn. Springer, The Netherlands. ISBN 978-90-481-6025-9, 978-94-017-3381-6. http://gen.lib.rus.ec/book/index.php?md5=85a62a75bf973ae5d16ad2cfe707a237
Anderson S, Auquier A, Hauck WW, Oakes D, Vandaele W, Weisberg HI (1980) Statistical methods for comparative studies. Chichester, Brisbane, New York
Book MATH Google Scholar
Raftery AE, Chunn JL, Gerland P, Ševčíková H (2013) Bayesian probabilistic projections of life expectancy for all countries. Demography 50(3):777–801
Article Google Scholar
Meshram SS (2020) Comparative analysis of life expectancy between developed and developing countries using machine learning. In: 2020 IEEE Bombay section signature conference (IBSSC). IEEE, pp 6–10
Lesnussa YA, Rumlawang FY, Risamasu E, Fhilya C (2020) Prediction of life expectancy in Maluku province using artificial neural networks backpropagation. J Mat Integr 16(2):75–82
Google Scholar
Donges N (2021) A complete guide to the random forest algorithm, July 2021. https://builtin.com/data-science/random-forest-algorithm#procon
Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 785–794
Wade C (2020) Hands-on gradient boosting with XGBoost and scikit-learn: perform accessible machine learning and extreme gradient boosting with Python. Packt Publishing
Joseph VR (2022) Optimal ratio for data splitting. Stat Anal Data Min: ASA Data Sci J 15(4):531–538
Wang M-X, Huang D, Wang G, Li D-Q (2020) SS-XGBoost: a machine learning framework for predicting newmark sliding displacements of slopes. J Geotech Geoenviron Eng 146(9):04020074
Article Google Scholar
Kuhn M (2021) caret: classification and regression training. R package version 6.0-88. https://CRAN.R-project.org/package=caret
R Core Team (2022) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org/
Bakas I, Kontoleon KJ (2021) Performance evaluation of artificial neural networks (ANN) predicting heat transfer through masonry walls exposed to fire. Appl Sci 11(23):11435
Article Google Scholar
Minaee S (2019) An introduction to the most important metrics for evaluating classification, regression, ranking, vision, NLP, and deep learning models: part 1-classification and regression evaluation metrics. Towards Data Sci. https://towardsdatascience.com/20-popular-machine-learning-metrics-part-1-classification-regression-evaluation-metrics-1ca3e282a2ce
Kassambara A (2018) Machine learning essentials: practical guide in R. STHDA
Sedgwick P (2014) Spearman’s rank correlation coefficient. Bmj 349:g7327
Korkmaz S, Goksuluk D, Zararsiz G (2014) MVN: an R package for assessing multivariate normality. R J 6(2):151–162
Article Google Scholar
Fanyin H, Mazumdar S, Tang G, Bhatia T, Anderson SJ, Dew MA, Krafty R, Nimgaonkar V, Deshpande S, Hall M et al (2017) Non-parametric MANOVA approaches for non-normal multivariate outcomes with missing values. Commun Stat-Theory Methods 46(14):7188–7200
Article MathSciNet MATH Google Scholar
Mamidanna SK, Reddy CR, Gujju A (2022) Detecting an insider threat and analysis of XGBoost using hyperparameter tuning. In: 2022 International conference on advances in computing, communication and applied informatics (ACCAI). IEEE, pp 1–10
Sun X (2021) Application and comparison of artificial neural networks and XGBoost on Alzheimer’s disease. In: Proceedings of the 2021 international conference on bioinformatics and intelligent computing, pp 101–105
Josse J, Husson F (2016) missMDA: a package for handling missing values in multivariate data analysis. J Stat Softw 70(1):1–31. https://doi.org/10.18637/jss.v070.i01
Article Google Scholar
Wang Y, Ni XS (2019) A XGBoost risk model via feature selection and Bayesian hyper-parameter optimization. arXiv preprint arXiv:1901.08433
Kassambara A (2017) Practical guide to principal component methods in R: PCA, M (CA), FAMD, MFA, HCPC, factoextra, vol 2. STHDA
Kaplan R, Spittel M, Zeno T (2014) Educational attainment and life expectancy. Policy Insights Behav Brain Sci 1:189–194, 10. https://doi.org/10.1177/2372732214549754
Article Google Scholar
Luy M, Zannella M, Wegner-Siegmundt C, Minagawa Y, Lutz W, Caselli G (2019) The impact of increasing education levels on rising life expectancy: a decomposition analysis for Italy, Denmark, and the USA. Genus 75(1):1–21
Article Google Scholar
Szwarcwald CL, de Souza Júnior PRB, Marques AP, da Silva de Almeida W, Montilla DER (2016) Inequalities in healthy life expectancy by Brazilian geographic regions: findings from the National Health Survey, 2013. Int J Equity Health 15(1):1–9
Article Google Scholar
Nestorovska MT, Levkov N (2019) Determinants of life expectancy: analysis of southeastern European countries. Knowl Int J 31:07. https://doi.org/10.35120/kij3101193t
Miladinov G (2020) Socioeconomic development and life expectancy relationship: evidence from the EU accession candidate countries. Genus 76(1):1–20
Article Google Scholar
Suder A, Jagielski P, Piórecka B, Płonka M, Makiel K, Siwek M, Wronka I, Janusz M (2020) Prevalence and factors associated with thinness in rural Polish children. Int J Environ Res Public Health 17(7):2368
Article Google Scholar
Tambalis KD, Panagiotakos DB, Psarra G, Sidossis LS (2019) Prevalence, trends and risk factors of thinness among Greek children and adolescents. J Prev Med Hyg 60(4):E386
Google Scholar
Pisal NS, Abdul-Rahman S, Hanafiah M, Kamarudin SI (2022) Prediction of life expectancy for Asian population using machine learning algorithms. Malays J Comput 7(2):1150–1161
Google Scholar
Chen YQ, Cheng S (2006) Linear life expectancy regression with censored data. Biometrika 93(2):303–313
Article MathSciNet MATH Google Scholar
Shang HL (2012) Point and interval forecasts of age-specific life expectancies: a model averaging approach. Demogr Res 27:593–644
Article Google Scholar
Dias N, Sucharitharathna C et al (2017) Prediction of life expectancy. Am Sci Res J Eng, Technol, Sci (ASRJETS) 34(1):252–260
Google Scholar
GHO (2022) Global Health Observatory data repository. Life expectancy and Healthy life expectancy data by country. https://apps.who.int/gho/data/view.main.SDG2016LEXv?lang=en
Omondi et al. (2022) A machine learning based prediction model for life expectancy, Dryad, Dataset. https://doi.org/10.5061/dryad.z612jm6fv
Omondi et al. (2022) A machine learning based prediction model for life expectancy, Dryad, Dataset. https://datadryad.org/stash/share/vKcd-rPCur8y_VKFHrjKPpD88mHdxGoJdBGkN9_3M3Y
Omondi et al. (2022) A machine learning based prediction model for life expectancy, Dryad, Dataset. https://zenodo.org/record/7319734

Download references

Acknowledgements

The authors acknowledges, with thanks, the support of Strathmore University, Institute of Mathematical Sciences for the production of this manuscript.

Funding

The authors received no direct funding for this research.

Author information

Authors and Affiliations

Institute of Mathematical Sciences, Strathmore University, P.O Box 59857-00200, Nairobi, Kenya
Brian Aholi Lipesa, Elphas Okango, Bernard Oguna Omolo & Evans Otieno Omondi
Division of Mathematics and Computer Science, University of South Carolina-Upstate, 800 University Way, Spartanburg, SC, 29303, USA
Bernard Oguna Omolo

Authors

Brian Aholi Lipesa
View author publications
You can also search for this author in PubMed Google Scholar
Elphas Okango
View author publications
You can also search for this author in PubMed Google Scholar
Bernard Oguna Omolo
View author publications
You can also search for this author in PubMed Google Scholar
Evans Otieno Omondi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

B. A. L.: conceptualization, data curation, formal analysis, writing original draft and writing-review and editing; E. O.: methodology, formal analysis, supervision and writing-review and editing; B. O. O.: formal analysis, methodology, supervision and writing-review and editing; E. O. O.: methodology, formal analysis, supervision and writing-review and editing. All authors gave final approval for publication and agreed to be held accountable for the work performed therein.

Corresponding author

Correspondence to Evans Otieno Omondi.

Ethics declarations

Conflict of interest declaration

The authors declare that they have no competing interests.

Ethical approval and consent to participate

The data used did not have any personal identification data and the Strathmore University Ethics Review Committee approval was obtained.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Lipesa, B.A., Okango, E., Omolo, B.O. et al. An application of a supervised machine learning model for predicting life expectancy. SN Appl. Sci. 5, 189 (2023). https://doi.org/10.1007/s42452-023-05404-w

Download citation

Received: 11 March 2023
Accepted: 30 May 2023
Published: 16 June 2023
DOI: https://doi.org/10.1007/s42452-023-05404-w

An application of a supervised machine learning model for predicting life expectancy

Abstract

Article highlights

Similar content being viewed by others

Predicting Bangladesh Life Expectancy Using Multiple Depend Features and Regression Models

Evaluating Models for Better Life Expectancy Prediction

Application of machine learning methods for predicting under-five mortality: analysis of Nigerian demographic health survey 2018 dataset

1 Introduction

2 Methods and materials

2.1 Data source and study variables

2.2 Definition of variables

2.3 Data splitting and model validation metrics

2.4 Study design and population

2.5 Statistical analysis

2.6 Feature scaling and missingness

2.7 eXtreme gradient boosting

3 Results

3.1 Overall findings

3.2 Key determinants of life expectancy

3.3 Association of life expectancy and its selected predictors

3.4 Model performance results

4 Discussion

5 Conclusion

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest declaration

Ethical approval and consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation