1 Introduction

The Organisation for Economic Cooperation and Development, the author in [1] defines life expectancy (LE) at birth as how long on average, a newborn can expect to live, if current death rates remain constant. LE is an estimate of the average number of additional years that a person of a given age can expect to live. Precisely, it tells us the average age of death in a population. Estimates suggest that in the ancient times, LE was around 30 years in all regions of the world [2]. Prior to the COVID-19 pandemic, population health was improving globally, increasing the global average LE at birth from 66.8 years in 2000 to 73.3 years in 2019. In 2019, LE for males reached 70.9 years. For females, the equivalent figure was 75.9 years. The African Region still had the lowest LE among WHO regions in 2019, at only 64.5 years, despite experiencing the largest gains over the past two decades. The Region of the Americas had the highest LE (74.1 years) in 2000 but dropped to third place (77.2 years) in 2019 as the European and Western Pacific Regions made accelerated gains, reaching 78.2 and 77.7 years, respectively [3].

At the UN Sustainable Development Summit in September 2015, world leaders adopted the Sustainable Development Goals (SDGs) as the symbol of the global agenda for development through 2030. These goals are a set of 17 commitments made by the 193 world leaders, for fair and sustainable health at every level. Their aim is to end extreme poverty, inequality and climate change by the year 2030 [4]. Ensuring healthy lives and promotion of the well-being for the population of the world for all at all ages is vital to sustainable development [5]. Life expectancy is a key measure often used to assess the overall well-being and health of a population. It is an indicator of a country’s overall health, reflecting its social and economic conditions, as well as the quality of its public health and healthcare infrastructure, among other factors [6]. Wang et al. [7] notes that the fundamental goal of a health system is to prolong healthy life into advanced age.

A slight growth in LE translates into sizeable increases in the population. Lengthening a population’s LE tends to increase the number of individuals living with chronic illnesses, a common feature among the elderly. Thus, the ability to project how populations will age has massive implications on the planning, support and service provision of a Nation. The computation of life expectancy involves building an ordinary life table [8]. The life table is a population model covering the simple case of a birth cohort, born at the same time period, closed to migration, and followed through successive ages until they die [9]. The life table assumes the homogeneity of cohorts (i.e., subjects have the same distribution of survival times) [10]. In life tables, the LE of persons alive at age x is computed as:

$$\begin{aligned} \displaystyle \text {e}_x=\frac{T_x}{l_x}, \end{aligned}$$

where \(T_x\) is the total number of person-years lived by the cohort from age x until all members of the cohort have died and \(l_x\) is the number of persons alive at age x [9]. Since the use of life tables takes a long time following through successive ages till individuals’ departure from life, the resultant estimates from this approach may not be applicable to the current populations due to the current technological and other factors affecting LE. The homogeneity assumption of life tables is not very realistic in practice.

Furthermore, in practice, individuals are prone to censoring either by death or out-migration resulting into biasness of the estimates from life tables. As a result, a robust and more accurate approach is inevitable. Raftery et al. [11] developed a Bayesian Hierarchical Model (BHM) for probabilistic life expectancy forecasts for all nations across the world. The model’s resultant performance was compared with the UN methodology. The 10-year out-of-sample predictions for the BHM model attained a mean absolute error (MAE) and a standard absolute prediction error (SAPE) of 1.07 and 1.04, respectively outmatching the UN approach that had initially reported a MAE of 1.86. Meshram [12] adopted the random forest (RF) regressor to compare life expectancy between advanced and developing economies, achieving MSE and MAE values of 4.43 and 1.58, respectively. The back-propagation artificial neural network (ANN) technique was employed by [13] to forecast life expectancy in Maluku province, Indonesia. The findings from this research reported a mean absolute percentage error (MAPE) of 0.0035 and an average accuracy of 99.65%. The downside of the above three major approaches discussed is that:

  1. (a)

    The Bayesian approach requires a prior distribution even when very limited prior information is available.

  2. (b)

    The random forest regressor may be unreliable and slower for real-time predictions in settings with high number of trees [14].

  3. (c)

    The back-propagation ANN model was evaluated based on 5 and 2 training and testing samples, respectively. This amount of data is considerably small for training neural networks.

The study in this paper proposed the use of eXtreme Gradient Boosting (XGBoost) algorithm to predict life expectancy, considering health, socioeconomic and behavioral factors. Specifically, the study sought to: (a) identify and select the predictors considered important in contributing to life expectancy; and (b) assess the predictive performance of the developed model based on life expectancy data. XGBoost has been widely used to produce state-of-the-art results on many machine learning problems [15]. XGBoost is competitive in terms of its ability to: handle missingness, build machine learning models more quickly and add built-in regularization to achieve accuracy gains [16]. These features render XGBoost to be a faster, more accurate, and a more desirable algorithm over other comparable ensemble algorithms, for the dataset at hand.

The remaining sections of this paper are structured as follows: Sect. 2 presents the methods and materials employed, including a description of the dataset used in this study. Section 3 details the findings from this study and compares the proposed life expectancy prediction model with the RF and ANN models used in earlier research. Section 4 discusses the prediction model and experimental outcomes in relation to existing research methodologies. Section 5 wraps up the paper with some suggestions for future research.

2 Methods and materials

2.1 Data source and study variables

The data for this study were sourced from the databases of the World Health Organization (WHO) and the United Nations (UN). The data covered 193 UN member states from the year 2000–2015 and comprised LE health-related factors drawn from the Global Health Observatory data repository, while the UN data repository provided the corresponding socioeconomic-related factors for the 193 countries. After merging the individual data files, the resultant dataset contained 2937 observations of 22 variables. However, during the data pre-processing phase, it was discovered that the data values for the “population” variable were incorrect, and they were replaced with new values from the World Bank Development Indicators’ dataset. Furthermore, the “percentage expenditure on health as per GDP” variable was found to have erroneous values and was excluded from the study. Due to missing data for a considerable number of years, the data for Niue, San Marino, Cook Islands, Marshall Islands, Monaco, Palau, Tuvalu, and Dominica were excluded from this study. After data cleaning, the final dataset used in this study consisted of 2832 observations of 21 variables.

These data cleaning measures were taken to ensure the accuracy and reliability of the dataset used in our study. Tables 1 and 2 presents the description of the data in terms of its summary statistics and frequencies, respectively. The detailed definition of the study variables is as presented in Table 3.

Table 1 Summary statistics for continuous variables
Table 2 Summary frequencies for categorical variables

2.2 Definition of variables

2.3 Data splitting and model validation metrics

Data splitting is a routinely employed technique for model validation where a given dataset is partitioned into two disjoint sets: training and testing. The proposed model in this study was applied to a real-life dataset. The dataset was randomly split into a training and testing set following the commonly applied ratio of 80 and 20%, respectively [17, 18]. The training set was used for the model development and the hold-out (testing) set employed for model validation. Automatic random selection was used to avoid the introduction of potential figure selection bias via use of the caret [19] package in R. The study used R version 4.2.0 program from [20] as the primary software for all the statistical computations and analyses.

Table 3 Definition of the study variables

As Bakas and Kontoleon [21] recommends, after the model was developed, trained and found to be functional, its performance was evaluated before proposing its use for industry or further scientific research. Due to their commonness in measuring accuracy for continuous variables and their nature of being quick and easy to calculate, the study employed the Root Mean Squared Error (RMSE) and the Mean Absolute Error (MAE) to examine the accuracy of the resultant prediction model. RMSE is naturally influenced by outliers since squaring the errors in the MSE gives extreme values (usually having higher errors than other samples) more attention and dominance in the final error, thereby impacting the model parameters [22]. MAE has been known to be more robust to outliers [23]. To address the limitation of RMSE, the study used the MAE as a complementary measure for model validation.

RMSE essentially finds the square root of the average squared error between the actual value and the predicted value by the model. It is a standard way to measure the error of a model in numerical predictions. It is defined as:

$$\begin{aligned} \displaystyle \text {RMSE}=\displaystyle \sqrt{\frac{1}{n}\sum _{i=1}^n \left( y_i-{\hat{y}}_i\right) ^2} \end{aligned}$$
(1)

where \(y_i\) is the ith observed value of the response for the data, \({\hat{y}}_i\) is the corresponding ith predicted value using the fitted model and the predictors from the data, and n is the number of observations. Lower RMSE values indicate a good fit [23].

Mean Absolute Error (MAE) (or the mean absolute deviation) finds the average absolute distance between the target value and the predicted value. It is defined as:

$$\begin{aligned} \displaystyle \text {MAE}=\frac{1}{n}\left[ \sum _{i=1}^n \left| y_i-{\hat{y}}_i\right| \right] \end{aligned}$$

where, \(y_i,\,{\hat{y}}_i,\,\) and n are explained in (1). \(\left| y_i-{\hat{y}}_i\right| \) is the absolute error. A lower MAE value is an indication of a good fit.

2.4 Study design and population

This study followed a cross-sectional design where information on 21 study variables for 193 UN member states from the year 2000 to 2015 were obtained. The study population were the UN member states that had data for all the years from 2000 through 2015. Countries that completely missed data on any particular year were excluded from the study. As of the year 2022, the Worldatlas states that there were 195 countries in the world. The study population was based on the readily available data for the 193 countries that represent nearly the entire globe. The data partitioning was based on the split ratio.

2.5 Statistical analysis

In order to convert the raw data into a meaningful and efficient format, data pre-processing was undertaken in the preliminary phase of the analysis. Data was inspected to check for missingness and structure of the variables, and appropriately cleaned. Summary statistics were employed to establish the measures of central tendency. EDA was applied to understand the underlying structure of the dataset and uncover useful insights that were relevant for this study. Spearman’s rank correlation coefficient was used to measure the associations between study variables as explained in [24], and the resultant correlation matrix visualized through a heatmap via the corrplot package in R. A multivariate normality check was carried out using the MVN package in [25] to ascertain whether the data was drawn from a normally distributed population.

The multivariate normality assumption was violated. Thus, an assessment of whether or not the Regions and Income Groups had different mean vectors across the different study variables was conducted using the Extended Multivariate Kruskal–Wallis (E-MKW) test with missing data put forth by [26] in their paper. Data for the year 2015 for all the countries was reserved for making predictions. The remaining data was split into two disjoint partitions for the train and test sets by use of the caret package [19]. The baseline model was initialized using the default XGBoost hyperparameters. Subsequently, the hyperparameters were fine-tuned via a grid search cross-validation to arrive at an optimal set of hyperparameters used in developing the final XGBoost model.

2.6 Feature scaling and missingness

Feature scaling (also known as data normalization) is a process to standardize the variables present in a dataset to a constant scale [27]. Prior to the selection of variables, feature scaling was undertaken on the numeric variables using the Z-Score technique. The formula used for computing the Z-Score of a data point, x, was as follows:

$$\begin{aligned} \displaystyle x'=\frac{(x-\mu )}{\sigma }, \end{aligned}$$

where \(\mu \) and \(\sigma \) are the mean and standard deviation of the feature vector, respectively. The predictors were selected through principal component analysis (PCA) where, the variables with significant contribution to the first principal component were chosen. Since the position of the split point is unaffected by feature scaling in XGBoost, there was no scaling applied to the chosen predictors for the XGBoost model [28]. As a preliminary step prior to performing PCA, the missing values in the life expectancy dataset were imputed with the Principal Components Analysis model via the missMDA package [29]. missMDA imputes the incomplete data set in such a way that the imputed values will not have any weight on the results of PCA. The missing values were predicted using the iterative PCA algorithm with two dimensions being used to predict the missing entries.

2.7 eXtreme gradient boosting

eXtreme Gradient Boosting, abbreviated as XGBoost is a decision-tree based ensemble machine learning algorithm that uses a gradient boosting framework proposed by Chen and Guestrin [15] in 2015. It is a novel classification and regression problems implementation algorithm frequently applied due to its rapidness, efficiency and scalability [30]. In gradient boosting, the function that determines the ith row prediction consists the sum of all previous functions. Suppose there are K boosted trees, mathematically, the XGBoost model will be in the form:

$$\begin{aligned} \displaystyle {\hat{y}}_i=\sum _{k=1}^K f_k(x_i),\,\, f_k\in F, \end{aligned}$$

where \({\hat{y}}_i\) is the predicted value by the ML model for the ith row, K is the number of boosted trees, \(x_i\) is the ith data point (a vector whose entries are the columns of the ith row), f is a function in the functional space F, and F is the set of all possible Classification And Regression Trees (CARTs). Similarly, the kth boosted tree prediction will be defined as:

$$\begin{aligned} \displaystyle {\hat{y}}_{i}^{(k)}=\sum _{k=1}^K f_k (x_i). \end{aligned}$$
(2)

The trees are trained by defining an objective function and optimizing it. The objective function to be optimized for the kth boosted tree is given by:

$$\begin{aligned} \displaystyle Obj^{(k)}=\sum _{i=1}^N L(y_i, {\hat{y}}_{i}^{(k)})+\sum _{k=1}^K\Omega (f_k), \end{aligned}$$
(3)

where \(L(y_i, {\hat{y}}_{i}^{(t)})\) is the training loss function (i.e., MSE) of the kth boosted tree and \(\Omega (f_k)\) is the ith regularization term, a penalty term to prevent over-fitting [16]. The loss and regularization functions are derived based on [15] to come up with the learning objective function. The general architecture of the XGBoost algorithm is as shown in Fig. 1.

Fig. 1
figure 1

A general architecture of the XGBoost algorithm

Since gradient-boosted trees sum the predictions of the previous trees, in addition to the prediction of the new tree, it follows from (2) that:

$$\begin{aligned} \displaystyle {\hat{y}}_{i}^{(k)}={\hat{y}}_{i}^{(k-1)}+ f_k(x_i), \end{aligned}$$
(4)

which is the idea behind additive training. Substituting (4) into the preceding learning objective in (3), we obtain:

$$\begin{aligned} \displaystyle Obj^{(k)}=\sum _{i=1}^N L \left( y_i, {\hat{y}}_{i}^{(k-1)}+f_k(x_i)\right) +\Omega (f_k). \end{aligned}$$

The above equation can be re-written as follows:

$$\begin{aligned} \displaystyle Obj^{(k)}=\sum _{i=1}^N \left[ y_i-({\hat{y}}_{i}^{(k-1)}+f_k(x_i))\right] ^2+\Omega (f_k). \end{aligned}$$

Multiplying the polynomial out, the objective function becomes:

$$\begin{aligned} \displaystyle Obj^{(k)}=\sum _{i=1}^N \left[ 2 \left( y_i-{\hat{y}}_{i}^{(k-1)}\right) f_k(x_i)+f_k(x_i)^2\right] +\Omega (f_k)+C, \end{aligned}$$

where C is a constant term independent of k. The goal was to find the optimal value of \(Obj^{(k)}\), the optimal function mapping the samples (roots) to the predictions (leaves). XGBoost uses the Newton Rhapson’s Method with a second-order Taylor expansion to get the following:

$$\begin{aligned} \displaystyle Obj^{(k)}=\sum _{i=1}^N \left[ g_if_k(x_i)+\frac{1}{2}h_if_k(x_i)^2\right] +\Omega (f_k), \end{aligned}$$

where \(g_i\) and \(h_i\) are the first and second partial derivatives of the loss function, respectively defined as:

$$\begin{aligned} g_i&=\frac{\partial l(y_i, {\hat{y}}_{i})}{\partial {\hat{y}}_i}=\frac{y_i}{{\hat{y}}_i}-\frac{1-y_i}{1- {\hat{y}}_i}=\frac{y_i(1-{\hat{y}}_i)-{\hat{y}}_i(1-y_i)}{{\hat{y}}_i(1-{\hat{y}}_i)}\\ &= \frac{y_i-y_i {\hat{y}}_i-{\hat{y}}_i+y_i {\hat{y}}_i}{{\hat{y}}_i(1-{\hat{y}}_i)} = \frac{y_i-{\hat{y}}_i}{{\hat{y}}_i(1-{\hat{y}}_i)},\\ h_i&=\frac{\partial ^2 l(y_i,{\hat{y}}_i)}{\partial {\hat{y}}_i^2}=\frac{\partial }{\partial {\hat{y}}_i}g_i=\frac{\partial }{\partial {\hat{y}}_i}\left[ \frac{y_i-{\hat{y}}_i}{{\hat{y}}_i(1-{\hat{y}}_i)}\right] \\&=\frac{y_i-1}{{({\hat{y}}_i-1)}^2}-\frac{y_i}{{\hat{y}}_i^2}\,. \end{aligned}$$

Having introduced the training step, the complexity of the tree \(\Omega (f_k)\), is defined. Let w be the vector space of leaves. Then f, the function mapping the root of the tree to the leaves can be given a different form in terms of w as follows:

$$\begin{aligned} \displaystyle f_k(x)=w_{q(x)}, \quad w\in R^T, \quad q:R^d\rightarrow \,\left\{ 1,2,\ldots ,T\right\} , \end{aligned}$$

where w is the vector of scores on leaves, q is the function assigning each data point to the corresponding leaf, and T is the number of leaves. In XGBoost, the regularization term is defined as shown in (5), where \(\gamma \) and \(\lambda \) are penalty constants to reduce overfitting [16]:

$$\begin{aligned} \displaystyle \Omega (f_k)=\gamma T +\frac{1}{2}\lambda \sum _{j=1}^T w_j^2. \end{aligned}$$
(5)

Combining the loss function with the regularization term yields the learning objective function as:

$$\begin{aligned} \displaystyle Obj^{(k)}=\sum _{i=1}^N \left[ g_i w_{q(x_i)}+\frac{1}{2}h_i w_{q(x_i)}^2\right] +\gamma T +\frac{1}{2}\lambda \sum _{j=1}^T w_j^2, \end{aligned}$$

which is the result XGBoost uses to determine how well the model fits the data.

3 Results

3.1 Overall findings

Results from the Extended Multivariate Kruskal–Wallis (E-MKW) rank sum test revealed significant differences in the mean vectors of the numeric variables across the regions (EM Kruskal–Wallis chi-squared = 2615.833, df = 220.4847, p-value = 0). Similarly, there were significant differences among the variables across the three income groups (EM Kruskal–Wallis chi-squared = 1128.493, df = 73.49491, p-value = 1.448515e−188). The populations of the richest countries in the world have life expectancies of over 60 years with the North American countries taking the lead with an average life expectancy of 79.88 years. The countries in the Sub-Saharan Africa region have the lowest life expectancy rates with a mean of 57.12 years. The details of the respective regional mean life expectancy rates are as highlighted in Table 4.

Table 4 Average life expectancy rates in years per region

3.2 Key determinants of life expectancy

Fig. 2a visualizes the contribution of study variables to the first principal component. Principal component one accounted for 32.6% of the variability in the dataset. The anticipated average contribution of the variables is shown by the red dashed line (6.25%).

Fig. 2
figure 2

Contribution of study variables relative to PC1 and the whole model

If the research variables contributed equally, the expected average value depicted as the cutoff in Fig. 2a would be arrived at as shown in (6). A variable is deemed essential for contributing to a specific principal component if its contribution exceeds this threshold [31].

$$\begin{aligned} \displaystyle \text {Expected Average Contribution}\,(\%)= & {} \left[ \frac{1}{\text {No. of Variables}}\right] \%\nonumber \\= & {} \frac{1}{16} \times 100=6.25\%. \end{aligned}$$
(6)

From Fig. 2a, the percent of thinness among children aged 5–9 and 10–19, number of years at school, average body mass index, number of under-five deaths and infant deaths per 1000 population were important in contributing to the first principal component. These six variables exceed the threshold of 6.25%, hence considered important. As the research variables for further analysis, the six variables as well as region and income group were chosen. The importance of the variables in the final model based on the gain measure is as depicted in Fig. 2b. Regional location, number of years at school, income group, number of under-five deaths per 1000 population, percent of thinness among children aged 5–9 and the average BMI are the key determinants of life expectancy, respectively.

3.3 Association of life expectancy and its selected predictors

Fig. 3 visualizes the Spearman rank correlation matrix heatmap between life expectancy and the six significant predictors. Percent of thinness among children aged 5–9 and 10–19, the number of under five deaths, and the number of infant deaths are negatively associated with LE. The number of years spent in school and the average BMI are positively correlated with LE. The percent of thinness among children aged 5–9 (\(t = -26.701\), df = 2654, p-value < 2.2e−16) and 10–19 (\(t = -27.092\), df = 2654, p-value < 2.2e−16) were significantly correlated with LE.

Fig. 3
figure 3

A correlation matrix Heatmap of the selected numeric predictors with LE

Similarly from Fig. 3, the number of years at school and life expectancy were significantly correlated (\(t = 58.709\), df = 2654, p-value < 2.2e−16). Furthermore, the average BMI (t = 35.733, df = 2654, p-value < 2.2e−16), number of under five deaths (\(t = -10.72\), df = 2654, p-value < 2.2e−16) and infant deaths (t = \(-\)9.3523, df = 2654, p-value < 2.2e−16) per 1000 population were established to have a significant correlation with life expectancy. It is evident from Fig. 4a that life expectancy improves as the percent of thinness among children aged 5–9 years reduces.

Fig. 4
figure 4

Scatter plots of the association between thinness and life expectancy

The scatter plot in Fig. 4b depicts a general trend of increasing life expectancy rates with a decrease in the percent of thinness among children aged 10–19 years. Low income countries seem to have higher percentages of thinness among 10–19 year olds in comparison to their middle and high income counterparts. From Fig. 5a, it is evident that life expectancy increases with the number of years spent in school across all the seven regions.

Fig. 5
figure 5

Scatter plots of the association between no. of years at school, BMI & life expectancy

High income countries have a majority of their populations spending more years in school compared to middle and low income countries. Figure 5b suggests an increase in life expectancy rates with average body mass index of the entire population. Majority of the countries in the Sub-Saharan Africa and South Asia have BMI within the healthy and underweight ranges. Most of the populations in Europe and Central Asia, Latin America and Caribbean, and Middle East and North Africa regions are obese.

Notably from Fig. 6a, life expectancy improves as the number of under five deaths per 1000 population decreases. High income countries appear to have the lowest numbers of under-five deaths. Low income countries in the Sub-Saharan Africa and South Asian regions have higher numbers of under-five deaths.

Fig. 6
figure 6

Scatter plots of the association between child mortality and life expectancy

It can further be observed from Fig. 6b that life expectancy increases with a decrease in the number of infant deaths per 1000 population. High income countries seem to experience the lowest numbers of infant mortality across all the regions. On the other hand, low income countries in the Sub-Saharan Africa and South Asian regions, and some of the middle income countries in East Asia and Pacific have higher numbers of infant mortality. As a general observation, life expectancy seems to be higher in the high income countries and lower in the low income countries across all the regions of the world.

3.4 Model performance results

The performance of the XGBoost model was compared with that of two other prominent machine learning models, Random Forest and Artificial Neural Networks (ANN) as shown in Table 5. For the RF model, the grid search optimization technique was used to test 10 different sets of hyperparameters for each of the following hyperparameters: n_estimators, max_depth, min_samples_split, min_samples_leaf, and max_features. As for the ANN model, it was defined with one hidden layer consisting of 20 neurons, and the input layer took in 8 predictor variables. The output layer consisted of one neuron. The model was compiled using mean squared error (MSE) as the loss function, root mean square propagation (RMSprop) as the optimizer with a learning rate of 0.005, and mean absolute error (MAE) as the performance metric. The model was trained for 100 epochs with a batch size of 32 and a validation split of 20%.

RMSprop was used as the optimizer since it is known to improve the convergence rate and generalization performance of the model. XGBoost model outperformed the ANN and RF models, respectively. The XGBoost and RF models were efficient in terms of the model training run-time. On this front, the ANN model performed poorly.

Table 5 RF, ANN & XGBoost models’ predictive performance results

Figs. 7a and 7b compare predicted and actual values in terms of life expectancy and mean life expectancy for the year 2015, respectively, and show that there is little fluctuation.

Fig. 7
figure 7

Actual vs. predicted life expectancy rates for the year 2015

4 Discussion

On 80% of the dataset, the XGBoost algorithm was utilized to create the model. The default hyperparameter values were used to create the baseline model. The gbtree booster method was used, with 0.3 for learning rate (eta), 6 for maximum tree depth (max_depth), 0 for gamma, 1 for minimum child weight (min_child_weight), 1 for row subsample ratio (subsample), and 1 for column subsample ratio (colsample_bytree) as tree booster parameter values. Regression with squared loss objective (reg:squarederror) was employed for the learning task parameter. A total of 100 trees were included in the model. The grid search optimization method was used to test 10 different sets of hyperparameters for each of the following XGBoost hyperparameters: eta, gamma, max_depth, min_child_weight, subsample, colsample_bytree, and nrounds.

After testing all possible combinations of these hyperparameters, the final set of hyperparameters that resulted in the highest accuracy and lowest error rate on the validation set were determined to be: eta = 0.3, gamma = 0, max_depth = 7, min_child_weight = 1, subsample = 1, colsample_bytree = 1, and nrounds = 500. These hyperparameters were then used to train the final XGBoost model in this study. The use of these optimized hyperparameters is believed to have improved the accuracy and reliability of the results. A 10-fold cross-validation with two partitioning repeats yielded the best bias-variance trade-off. For model training efficiency, 6 cores were used to run cross-validation in parallel. After partitioning, the remaining 20% of the dataset was used to evaluate the model. The model accuracy was assessed using RMSE and MAE.

Regional location, number of years at school, income group, number of under-five fatalities per 1000 population, and the percent of thinness among children aged 5–9 years were shown to be the major drivers of life expectancy in this study. These findings are consistent with earlier research by Kaplan et al. [32] and Luy et al. [33] who established that rising educational levels were a determinant of rising life expectancy. According to Szwarcwald et al. [34], life expectancy at birth for women and men living in the wealthiest regions was 5 years higher than for those living in the poorest regions. Nestorovska and Levkov [35] discovered that gross national income had a favorable and statistically significant influence on life expectancy. Miladinov [36] established that a country’s population health and socioeconomic development had a significant impact on life expectancy at birth.

Across all regions, the current study finds that life expectancy is greater in high-income nations and lower in low-income ones. These findings are explained by the fact that nations in the richest geographical areas, such as North America, have longer life expectancies. This is due to increased economic growth in such regions, as a result of better health care, social well-being, industrialization, and educational attainment levels. Across all seven regions, life expectancy rises as the number of years spent in school increases. When compared to medium and low income nations, high income countries have a majority of their inhabitants spending more years in school. This is because having a high level of literacy allows people to make better life decisions, such as having more job prospects, having more negotiating power in terms of remuneration, and having healthy food and lifestyle habits, to name a few. This elevates the standard of living for citizens in a country.

As a metric of economic development, improved GDP per capita boosts life expectancy at birth through promoting economic growth and development in a country, resulting in a longer lifespan. High-income nations outperform their middle and low-income peers in terms of economic growth and development. This is due to increasing economic growth, higher living standards, and improved health in the first world countries. Life expectancy rises when the number of under five deaths per 1000 people drops. High-income nations tend to have the fewest deaths among children under the age of five. This is due to increased access to better healthcare in advanced economies, particularly prenatal and postnatal care, as well as dietary issues. As a result, the risk of death for children under the age of five is quite low in these nations. Low-income nations, on the other hand, have a higher rate of mortality among children under the age of five. The highest estimations are seen in Sub-Saharan Africa and South Asia. This is due to low-income nations’ lag in terms of economic growth, employment rates, access to better healthcare, improved sanitation, and overall living standards.

As the percentage of thinness among children aged 5–9 years decreases, life expectancy increases. High-income countries have lower percentages of children in this cohort, with North American countries having the lowest percentages. Low-income countries, on the other hand, continue to have higher rates of childhood thinness, with Sub-Saharan Africa and South Asia topping the list. Thinness is connected to medical, societal, and economic difficulties [37]. Poor nutrition, caused by insufficient food and beverage consumption, a lack of available food and drink, chronic famine and food insecurity, is a key contributor to this occurrence, particularly in Sub-Saharan Africa and South Asia. The situation in high-income countries may be explained by the ambition of young girls to achieve a dreamy beauty of thinness, shown by the beauty modelling business [38].

The study presented by Pisal et al. [39] aims to predict life expectancy in the Asian population using tree classifier models, namely, J48, Random Tree, and Random Forest. In contrast, the authors of this paper employ an XGBoost regressor to estimate life expectancy globally. The results by Pisal et al. [39] reveal that the Random Forest model achieves the highest accuracy with 84% accuracy using 10-fold cross-validation. Furthermore, the study identifies significant predictors of life expectancy, including socioeconomic factors, educational status, health conditions, and infectious diseases. Chen and Cheng [40] proposed a linear mean residual life model and developed inference procedures for handling potential censoring. The study conducted simulations to assess the finite sample properties of the proposed methods. However, the paper lacks details on the illustration of the efficiency of the proposed approaches. While the linear mean residual life model is a valuable tool in certain contexts, it may not always be appropriate for the data at hand.

While the study by Shang [41] presents useful insights into the prediction of life expectancy using model averaging approaches, its limitations include a lack of information about the actual model evaluation metric values used and a comparative investigation limited to principal component approaches and uni-variate methods. The study also highlights the challenges of accurately predicting life expectancy, especially in different demographic groups and regions. In their research, Raftery et al. [11] proposes a Bayesian hierarchical model (BHM) to probabilistically project life expectancy for all countries worldwide, comparing its performance with the UN methodology. The BHM model achieved a mean absolute error (MAE) and standard absolute prediction error (SAPE) of 1.07 and 1.04, respectively, outperforming the UN approach which reported a MAE of 1.86 for 10-year out-of-sample predictions. However, this study has a number of limitations. Firstly, it excludes countries with a generalized HIV/AIDS epidemic, potentially affecting the generalizability of the findings. Secondly, the analysis focuses solely on forecasting life expectancy for males, limiting the ability to assess the model’s robustness. Additionally, the Bayesian approach employed in this study requires a prior distribution, which can be a disadvantage when external information is scanty.

The paper presented by Dias et al. [42] investigated the impact of sex, death cause, profession, and race on life expectancy in the Colombo district of Sri Lanka using generalized linear models (GLMs) and Kaplan-Meier estimates. The authors found that a univariate GLM could predict an individual’s lifetime, and that race, cause of death, and profession had no effect on lifespan. However, the corrected GLM had poor fit with low \(R^2\) and adjusted \(R^2\) values, indicating underfitting. The study was limited by its narrow focus on predictors and the need for a more robust modeling technique. In contrast, the current study aimed to broaden the scope of predictors and improve the accuracy as well as the generalizability of life expectancy predictions by incorporating other relevant factors, while employing a robust algorithm.

The present study demonstrates the competitiveness of the XGBoost algorithm with regard to its ability to handle missing data, build the model more quickly, and achieve superior accuracy gains over other comparable ensemble algorithms applied on life expectancy datasets. On the test set, the XGBoost model attained MAE and RMSE values of 1.554 and 2.402, respectively. The number of optimal iterations attained was 500, with a learning rate of 0.3 and a maximum tree depth of 7. These findings outperform those published by Meshram [12], that used the random forest regression model achieving an MAE value of 1.58. These results support the XGBoost regressor as a reliable and efficient model for estimating life expectancy across the globe.

5 Conclusion

Other than the research variables employed in this work, this study recognizes that other measures of health and environmental issues are imperative in the contribution to a population’s lifespan. Quality-adjusted life years (QALYs), disability-adjusted life years (DALYs), as well as pollution and climate change have an influence on life expectancy. However, their inclusion in the model was beyond the scope of this study. Secondly, data for all the factors included in this analysis was not easily obtainable for the years 2016 through 2021. As a result, the prediction model does not account for current events impacting life expectancy that may have happened in the UN member nations.

The results presented in this paper are by no means the best representation of life expectancy. Despite the shortcomings highlighted above, the work presents some interesting findings that are comparable to research done by others. Moreover, the results of the mathematical model are useful in assisting in the design of better life expectancy prediction strategies that can lead to the improvement of people’s well-being. Future studies may be explored by integrating other quality of life measures (such as QALYs and DALYs), and environmental components in the prediction model, in addition to updating the model with fresh data.